Data Exploration Project - Pumps¶

Lena Hammerer, Timo Heiß, Fabian Lulikat

Business Understanding¶

Problemstellung: Gegeben sind Daten über verschiedene Brunnen in Tansania.

Zielstellung: Ziel ist die Vorhersage des Wartungszustands ("Status") der Brunnen mithilfe von Machine Learning Modellen. Die Zielgröße ist damit die Spalte "status_group" mit den Ausprägungen "functional", "functional needs repair" und "non functional". Spezifischer sollen drei Modelle erstellt werden:

  • Prognosemodell, welche Brunnen funktionieren und welche nicht
  • Prognosemodell für die Brunnen die repariert werden müssen
  • Prognosemodell, das zwischen allen drei Klassen unterscheidet

Damit handelt es sich um Klassifikationsprobleme (zwei binäre, ein multinominales) des überwachten Lernens.

Metrik: Die Güte der Modelle soll anhand des AUC (Area Under Curve) der Testdaten gemessen werden.

Ziel ist es also ein vollständiges Notebook zu erstellen vom Datenimport, über EDA zur Aufbereitung der Daten, der Modellbildung, Analyse und der Validierung.

Data Understanding¶

Einlesen der Trainings- und Testdaten.¶

In [2]:
df_pump_train = pd.read_csv('pump_train.csv')
df_pump_test = pd.read_csv('pump_test.csv')
In [12]:
df_pump_train.head()
Out[12]:
id amount_tsh date_recorded funder gps_height installer longitude latitude wpt_name num_private ... water_quality quality_group quantity quantity_group source source_type source_class waterpoint_type waterpoint_type_group status_group
0 12237 30.0 2013-01-23 Government Of Tanzania 107 District Council 39.360880 -10.295705 Zahanati Membe 0 ... soft good enough enough river river/lake surface communal standpipe multiple communal standpipe functional
1 24566 0.0 2013-02-07 Private Individual 0 Edward 32.644074 -3.859265 Kwa Mama Wini 0 ... soft good insufficient insufficient shallow well shallow well groundwater other other functional
2 20536 0.0 2011-07-15 Sawaka 0 DWE 30.999792 -1.721771 Kwasimon 0 ... soft good enough enough shallow well shallow well groundwater other other non functional
3 30633 200.0 2011-03-25 Government Of Tanzania 2142 Commu 34.060324 -9.335288 none 0 ... soft good enough enough spring spring groundwater communal standpipe communal standpipe functional
4 2993 0.0 2011-02-17 African Muslim Agency 290 District Council 38.862874 -7.165410 Msikitini 0 ... soft good dry dry machine dbh borehole groundwater hand pump hand pump non functional

5 rows × 41 columns

In [107]:
df_pump_test.head()
Out[107]:
id amount_tsh date_recorded funder gps_height installer longitude latitude wpt_name num_private ... water_quality quality_group quantity quantity_group source source_type source_class waterpoint_type waterpoint_type_group status_group
0 69572 6000.0 2011-03-14 Roman 1390 Roman 34.938093 -9.856322 none 0 ... soft good enough enough spring spring groundwater communal standpipe communal standpipe functional
1 8776 0.0 2013-03-06 Grumeti 1399 GRUMETI 34.698766 -2.147466 Zahanati 0 ... soft good insufficient insufficient rainwater harvesting rainwater harvesting surface communal standpipe communal standpipe functional
2 49056 0.0 2011-02-20 Private 62 Private 39.209518 -7.034139 Mzee Hokororo 0 ... salty salty enough enough machine dbh borehole groundwater other other functional
3 58155 0.0 2011-09-04 Unicef 1656 DWE 34.569266 -9.085515 Kwa Rose Chaula 0 ... soft good dry dry river river/lake surface communal standpipe communal standpipe non functional
4 34169 0.0 2011-07-22 Hesawa 1162 DWE 32.920154 -1.947868 Ngomee 0 ... milky milky insufficient insufficient spring spring groundwater other other functional needs repair

5 rows × 41 columns

Data Dictionary¶

Variable Beschreibung
amount_tsh Total static head (amount water available to waterpoint)
date_recorded The date the row was entered
funder Who funded the well
gps_height Altitude of the well
installer Organization that installed the well
longitude GPS coordinate
latitude GPS coordinate
wpt_name Name of the waterpoint if there is one
num_private ?
basin Geographic water basin
subvillage Geographic location
region Geographic location
region_code Geographic location (coded)
district_code Geographic location (coded)
lga Geographic location
ward Geographic location
population Population around the well
public_meeting True/False
recorded_by Group entering this row of data
scheme_management Who operates the waterpoint
scheme_name Who operates the waterpoint
permit If the waterpoint is permitted
construction_year Year the waterpoint was constructed
extraction_type The kind of extraction the waterpoint uses
extraction_type_group The kind of extraction the waterpoint uses
extraction_type_class The kind of extraction the waterpoint uses
management How the waterpoint is managed
management_group How the waterpoint is managed
payment What the water costs
payment_type What the water costs
water_quality The quality of the water
quality_group The quality of the water
quantity The quantity of water
quantity_group The quantity of water
source The source of the water
source_type The source of the water
source_class The source of the water
waterpoint_type The kind of waterpoint
waterpoint_type_group The kind of waterpoint
status_group functional or non-functional or functional needs repair

Explorative Data Analysis (EDA)¶

In [108]:
df_pump_train.shape
Out[108]:
(50490, 41)
  • große Anzahl an Features
  • viele Datenpunkte
In [109]:
df_pump_train.describe().T
Out[109]:
count mean std min 25% 50% 75% max
id 50490.0 37113.857695 21447.241039 0.000000 18517.250000 37027.500000 55636.750000 7.424700e+04
amount_tsh 50490.0 319.158123 2987.517185 0.000000 0.000000 0.000000 20.000000 3.500000e+05
gps_height 50490.0 669.714280 693.123330 -90.000000 0.000000 371.000000 1322.000000 2.770000e+03
longitude 50490.0 34.078591 6.553382 0.000000 33.084409 34.906548 37.178899 4.034519e+01
latitude 50490.0 -5.708336 2.943806 -11.648378 -8.541643 -5.019807 -3.327550 -2.000000e-08
num_private 50490.0 0.459497 10.413455 0.000000 0.000000 0.000000 0.000000 1.402000e+03
region_code 50490.0 15.280511 17.550035 1.000000 5.000000 12.000000 17.000000 9.900000e+01
district_code 50490.0 5.641731 9.668596 0.000000 2.000000 3.000000 5.000000 8.000000e+01
population 50490.0 180.913389 479.854870 0.000000 0.000000 25.000000 215.750000 3.050000e+04
construction_year 50490.0 1301.724520 951.258506 0.000000 0.000000 1986.000000 2004.000000 2.013000e+03
  • unterschiedliche Skalen erfodert Normalisierung (z.B. für KNN)
In [110]:
df_pump_train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50490 entries, 0 to 50489
Data columns (total 41 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   id                     50490 non-null  int64  
 1   amount_tsh             50490 non-null  float64
 2   date_recorded          50490 non-null  object 
 3   funder                 47396 non-null  object 
 4   gps_height             50490 non-null  int64  
 5   installer              47380 non-null  object 
 6   longitude              50490 non-null  float64
 7   latitude               50490 non-null  float64
 8   wpt_name               50490 non-null  object 
 9   num_private            50490 non-null  int64  
 10  basin                  50490 non-null  object 
 11  subvillage             50175 non-null  object 
 12  region                 50490 non-null  object 
 13  region_code            50490 non-null  int64  
 14  district_code          50490 non-null  int64  
 15  lga                    50490 non-null  object 
 16  ward                   50490 non-null  object 
 17  population             50490 non-null  int64  
 18  public_meeting         47644 non-null  object 
 19  recorded_by            50490 non-null  object 
 20  scheme_management      47221 non-null  object 
 21  scheme_name            26548 non-null  object 
 22  permit                 47899 non-null  object 
 23  construction_year      50490 non-null  int64  
 24  extraction_type        50490 non-null  object 
 25  extraction_type_group  50490 non-null  object 
 26  extraction_type_class  50490 non-null  object 
 27  management             50490 non-null  object 
 28  management_group       50490 non-null  object 
 29  payment                50490 non-null  object 
 30  payment_type           50490 non-null  object 
 31  water_quality          50490 non-null  object 
 32  quality_group          50490 non-null  object 
 33  quantity               50490 non-null  object 
 34  quantity_group         50490 non-null  object 
 35  source                 50490 non-null  object 
 36  source_type            50490 non-null  object 
 37  source_class           50490 non-null  object 
 38  waterpoint_type        50490 non-null  object 
 39  waterpoint_type_group  50490 non-null  object 
 40  status_group           50490 non-null  object 
dtypes: float64(3), int64(7), object(31)
memory usage: 15.8+ MB
  • Features mit Missing Values
  • viele kategorielle Merkmale
In [111]:
print("Unique Values:")
nuniques = pd.DataFrame({"Feature":[], "nuniques":[]})
for col in df_pump_train.columns:
    if df_pump_train[col].dtype == object:
        nuniques.loc[len(nuniques)] = [col, df_pump_train[col].nunique()]
nuniques.sort_values("nuniques", ascending=False)
Unique Values:
Out[111]:
Feature nuniques
3 wpt_name 32390.0
5 subvillage 17774.0
12 scheme_name 2583.0
8 ward 2085.0
2 installer 1964.0
1 funder 1734.0
0 date_recorded 348.0
7 lga 125.0
6 region 21.0
14 extraction_type 18.0
15 extraction_type_group 13.0
11 scheme_management 12.0
17 management 12.0
25 source 10.0
4 basin 9.0
21 water_quality 8.0
16 extraction_type_class 7.0
19 payment 7.0
20 payment_type 7.0
26 source_type 7.0
28 waterpoint_type 7.0
22 quality_group 6.0
29 waterpoint_type_group 6.0
18 management_group 5.0
23 quantity 5.0
24 quantity_group 5.0
27 source_class 3.0
30 status_group 3.0
9 public_meeting 2.0
13 permit 2.0
10 recorded_by 1.0
  • viele kategorielle Features mit sehr vielen Ausprägungen (Ignorieren, Gruppieren, Target Encoding?)
  • recorded_by kann ignoriert werden, da ein konstanter Wert
Verteilung der Zielvariablen¶
In [112]:
df_pump_train["t_functional"] = df_pump_train["status_group"].apply(lambda x: 0 if x=="non functional" else 1)
df_pump_train["t_needs_rep"] = df_pump_train["status_group"].apply(lambda x: 1 if x=="functional needs repair" else 0)

targets = ["status_group", "t_functional", "t_needs_rep"]

plt.figure(figsize=(12,4))
i=0
for t in targets:
        i+=1
        plt.subplot(1, 3, i)
        df_pump_train[t].value_counts(normalize=True).plot(kind='bar')
        plt.title(t)
plt.tight_layout()
  • Imbalance insbesondere für den 3. Fall
Einfluss numerischer Features - Scattermatrix¶
In [113]:
df_pump_train["id"] = df_pump_train["id"].astype("category")
df_pump_train["region_code"] = df_pump_train["region_code"].astype("category")
df_pump_train["district_code"] = df_pump_train["district_code"].astype("category")
sns.pairplot(pd.concat((df_pump_train.select_dtypes(exclude="object"), df_pump_train["status_group"]), axis=1), hue="status_group")
Out[113]:
<seaborn.axisgrid.PairGrid at 0x19e0036c408>
  • Einfluss von latitude/longitude, gps_height leicht sichtbar
  • Ausreißer bei der latitude/longitude
  • 0-Werte bei construction_year
  • kein auf den ersten Blick ersichtiches "golden feature", nur leichte Einflüsse sichtbar
Korrelationmatrix¶
In [114]:
plt.figure(figsize=(8,6))
sns.heatmap(df_pump_train.corr(), annot=True, mask=np.triu(df_pump_train.corr()), vmin=-1, vmax=1, linewidths=0.25);
  • keine starke Korrelation mit Zielvariable
  • offensichtliche Korrelationen (Longitude, Latitude)
  • Keine neuen Erkenntnisse für die Problemstellung - kategorische Daten sollten betrachtet werden
Einfluss numerischer Features auf die Zielvariablen in Boxplots¶
In [115]:
num_features = list(filter(None, [col if df_pump_train[col].dtype == "int64" or df_pump_train[col].dtype == "float64" else None for col in df_pump_train.columns]))
num_features = list(set(num_features)-set(targets))
i=1
plt.figure(figsize=(12,18))
for f in num_features:
    plt.subplot(len(num_features), 3, i)
    sns.boxplot(x="t_functional", y=f, data=df_pump_train)
    plt.subplot(len(num_features), 3, i+1)
    sns.boxplot(x="t_needs_rep", y=f, data=df_pump_train)
    plt.subplot(len(num_features), 3, i+2)
    sns.boxplot(x="status_group", y=f, data=df_pump_train)
    i+=3
plt.tight_layout()
  • häufig nicht gut erkennbar aufgrund vieler Ausreißer
  • bestätigt Einfluss von longitude, latitude und gps_height
  • construction_year muss erst bereinigt werden

construction_year ohne 0-Werte:

In [116]:
plt.figure(figsize=(12,4.5))
plt.subplot(1, 3, 1)
sns.boxplot(x="t_functional", y="construction_year", data=df_pump_train.loc[df_pump_train["construction_year"]!=0])
plt.subplot(1, 3, 2)
sns.boxplot(x="t_needs_rep", y="construction_year", data=df_pump_train.loc[df_pump_train["construction_year"]!=0])
plt.subplot(1, 3, 3)
sns.boxplot(x="status_group", y="construction_year", data=df_pump_train.loc[df_pump_train["construction_year"]!=0])
plt.xticks(rotation=90)


plt.tight_layout()
  • Construction Year hat Einfluss auf Zielvariable - ältere Brunnen funktionieren tendenziell eher nicht
Einfluss kategorieller und binärer Merkmale¶
In [117]:
qualitative_features = list(filter(None, [col if df_pump_train[col].dtype==object or df_pump_train[col].dtype==bool or df_pump_train[col].dtype=='category' else None for col in df_pump_train.columns]))
qualitative_features = list(set(qualitative_features)-set(targets))

i=1
fig = plt.figure(figsize=(12,64))
for col in qualitative_features:
    if df_pump_train[col].nunique() < 10 and df_pump_train[col].nunique()>1:
        for target in targets:
            ax = fig.add_subplot(15, 3, i)
            ct = pd.crosstab(index=df_pump_train[col],columns=df_pump_train[target])
            ct.plot.bar(ax=ax, rot=0)
            plt.xticks(rotation=90)
            plt.legend(fontsize=8)
            i+=1
plt.tight_layout()
  • einige redundante Merkmale, die sich nur wenig unterscheiden (Bsp. s.u.)
  • other/unknown ist häufig eine wichtige Ausprägung (z.B. bei waterpoint)
  • einige Merkmale zeigen einen (geringen) Einfluss (erkennbar an anderem Verhältnis der Balken für unterschiedliche Merkmalsausprägungen)
In [118]:
df_pump_train[["payment", "payment_type"]].drop_duplicates()
Out[118]:
payment payment_type
0 pay per bucket per bucket
1 unknown unknown
2 never pay never pay
3 pay monthly monthly
19 other other
20 pay annually annually
43 pay when scheme fails on failure
  • Payment und Payment Type enthalten die selbe Information
  • gilt wie oben beschrieben für eine Reihe von Features
Geografische Zusammenhänge¶
In [4]:
import geopandas as gpd
fig, ax = plt.subplots(figsize=(8,6))

countries = gpd.read_file(gpd.datasets.get_path("naturalearth_lowres"))
countries[countries["continent"] == "Africa"].plot(color="lightgrey", ax=ax)
df_pump_train.plot(x="longitude", y="latitude", kind="scatter", c="tab:blue", colormap="Blues", title="Geographic Distribution", ax=ax)
Out[4]:
<AxesSubplot:title={'center':'Geographic Distribution'}, xlabel='longitude', ylabel='latitude'>
  • alle Brunnen sind in Tansania, bis auf Ausreißer bei (0,0)
In [3]:
import plotly.express as px
fig = px.scatter_mapbox(df_pump_train, lat="latitude", lon="longitude", color="status_group", zoom=5, opacity=0.25, hover_name="id")
fig.update_layout(
    mapbox_style="white-bg",
    mapbox_layers=[
        {
            "below": 'traces',
            "sourcetype": "raster",
            "sourceattribution": "United States Geological Survey",
            "source": [
                "https://basemap.nationalmap.gov/arcgis/rest/services/USGSImageryOnly/MapServer/tile/{z}/{y}/{x}"
            ]
        }
      ])
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()
  • Grafik zeigt geografische Verteilung der Zielvariable
  • Leichter Einfluss ist erkennbar - unreine Clusterbildung
In [4]:
# interesting = ["gps_height", "basin", "region"]
fig = px.scatter_mapbox(df_pump_train, lat="latitude", lon="longitude", color="region", zoom=5, opacity=0.25, hover_name="id")
fig.update_layout(
    mapbox_style="white-bg",
    mapbox_layers=[
        {
            "below": 'traces',
            "sourcetype": "raster",
            "sourceattribution": "United States Geological Survey",
            "source": [
                "https://basemap.nationalmap.gov/arcgis/rest/services/USGSImageryOnly/MapServer/tile/{z}/{y}/{x}"
            ]
        }
      ])
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()
  • Grafik stellt die Regionen farblich dar
In [5]:
fig = px.scatter_mapbox(df_pump_train, lat="latitude", lon="longitude", color="construction_year", zoom=5, opacity=0.25, hover_name="id")
fig.update_layout(
    mapbox_style="white-bg",
    mapbox_layers=[
        {
            "below": 'traces',
            "sourcetype": "raster",
            "sourceattribution": "United States Geological Survey",
            "source": [
                "https://basemap.nationalmap.gov/arcgis/rest/services/USGSImageryOnly/MapServer/tile/{z}/{y}/{x}"
            ]
        }
      ])
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()
  • geografische Systematik hinter den 0-Werten in construction_year erkennbar

Data Preparation¶

Die Vorbereitung der Daten wird in einer Pipeline durchgeführt, die schnell und einfach angepasst sowie problemlos auf die Testddaten angewandt werden kann.
Zielsetzung:

  • Nullwerte ersetzen und auffüllen
  • Kategorielle Merkmale encoden
  • Doppelte Merkmale entfernen
  • Merkmale skalieren
  • Feature Engineering
In [5]:
#Prüft ob Name vorhanden, dann 1, sonst 0
class YesNoEncoder(BaseEstimator, TransformerMixin):
    def __init__(self, add_column = True):
        self.add_column = add_column
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        if self.add_column:
            has_name = np.where(X['wpt_name']!= 'none', 1, 0)
            Xcopy = X
            Xcopy['wpt_name'] = has_name
        return Xcopy
In [6]:
#Füllt Nullwerte auf, erkennt auch Werte, die vom normalen Imputer nicht erkannt werden: if str(x)==str(np.nan)
class MyImputer(BaseEstimator, TransformerMixin):
    def __init__(self, fill_value="other"):
        self.fill_value = fill_value
        return None
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        return np.array([self.fill_value if str(x)==str(np.nan) else x for x in X]).reshape(-1,1)
In [7]:
scheme_pipeline = Pipeline([
('imputer', MyImputer(fill_value="other")),
('one_hot', OneHotEncoder(handle_unknown='ignore'))
])
In [8]:
scheme_pipeline_lbl = Pipeline([
    ('imputer', MyImputer(fill_value="other")),
    ('ord_enc', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=np.nan)),
    ('imp2', SimpleImputer(fill_value=0.5)),
    ('min_max', MinMaxScaler())
])
In [9]:
#Reduziert Datumswerte auf reine Jahresinformation
class YearAdder(BaseEstimator, TransformerMixin):
    def __init__(self):
        return None
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        date_column = pd.Series(X)
        return np.array(pd.DatetimeIndex(date_column).year).reshape(-1,1)
        
In [10]:
year_pipeline = Pipeline([
    ("add_year", YearAdder()),
    ("scale", MinMaxScaler())
])
In [12]:
age_pipeline = Pipeline([
    ("imp_age", SimpleImputer(strategy="median")),
    ("scale", MinMaxScaler())
])
In [13]:
col_imp = Pipeline([
    ('simp_inputer', SimpleImputer(strategy='constant', fill_value='Unknown')),
    ('target_enc', TargetEncoder()),
    ('min_max', MinMaxScaler())
])
In [14]:
target_enc_pipe = Pipeline([
    ('target_enc', TargetEncoder()),
    ('min_max', MinMaxScaler())
])
In [15]:
ord_enc_pipe = Pipeline([
    ('ord_enc', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=np.nan)),
    ('imp2', SimpleImputer(fill_value=0.5)),
    ('min_max', MinMaxScaler())
])
In [16]:
df_train = df_pump_train.copy()

#Erstellt neue Zielvariable für die unterschiedlichen Fragestellungen
df_train["t_functional"] = df_train["status_group"].apply(lambda x: 0 if x=="non functional" else 1)
df_train["t_needs_rep"] = df_train["status_group"].apply(lambda x: 1 if x=="functional needs repair" else 0)

#Ausreißer entfernen
df_train = df_train.drop(df_train[df_train.longitude == 0].index)

#Manuelles Featuere Engineering
df_train["amount_per_capita"] = df_train.apply(lambda x: x["amount_tsh"]/x["population"] if x["population"] != 0 else 0, axis=1)
df_train["age"]=df_train['date_recorded'].apply(lambda x: datetime.datetime.strptime(x,'%Y-%m-%d'))-df_train["construction_year"].apply(lambda x: datetime.datetime.strptime(str(x),'%Y') if x!=0 else datetime.datetime.now())
df_train["age"].loc[df_train["construction_year"]!=0]=df_train["age"].loc[df_train["construction_year"]!=0].apply(lambda x: x.days)
df_train["age"].loc[df_train["construction_year"]==0]=np.nan

#Datentypen manuell transformieren
df_train["district_code"] = df_train["district_code"].astype("category")
df_train["region_code"] = df_train["region_code"].astype("category")

#label / target Enconding
y_train = df_train.loc[:,'status_group']
le = LabelEncoder()
y_train = le.fit_transform(y_train)
y_train_f = df_train["t_functional"]
y_train_r = df_train["t_needs_rep"]

Verschiedene Transformer, die später getestet werden.

Alle Spalten mit target, binary oder one-hot encoding transformiert, Duplikate werden gedroppt:

In [17]:
col_trans = ColumnTransformer(transformers=[
    ['col_imp', col_imp, ['funder', 'installer', 'subvillage']],
    ['target_enc', target_enc_pipe, ['lga', 'ward', 'management']],
    ['oh_enc', OneHotEncoder(), ['basin', 'payment','water_quality', 'quantity','source','waterpoint_type', 'extraction_type_class']],
    ['bin_enc', BinaryEncoder(), ['region', 'region_code', 'district_code']],
    ['cust_trans', YesNoEncoder(), ['wpt_name']],
    ["num", MinMaxScaler(), ["amount_tsh", "population", "amount_per_capita", 'gps_height', 'longitude', 'latitude']],
    ["scheme", scheme_pipeline, "scheme_management"],
    ["year",  year_pipeline, "date_recorded"]
], remainder='drop')

Transformation mit Ordinal Encoder:

In [18]:
col_trans_lbl = ColumnTransformer(transformers=[
    ['col_imp', col_imp, ['funder', 'installer', 'subvillage']],
    ['target_enc', target_enc_pipe, ['lga', 'ward', 'management']],
    ['ord_enc', ord_enc_pipe, ['basin', 'payment','water_quality', 'quantity','source','waterpoint_type', 'extraction_type_class', 'region', 'region_code', 'district_code']],
    ['cust_trans', YesNoEncoder(), ['wpt_name']],
    ["num", MinMaxScaler(), ["amount_tsh", "population", "amount_per_capita", 'gps_height', 'longitude', 'latitude']],
    ["scheme", scheme_pipeline_lbl, "scheme_management"],
    ["year",  year_pipeline, "date_recorded"]
], remainder='drop')

Fügt neues Merkmal 'Pumpenalter' hinzu:

In [19]:
col_trans_w_age = ColumnTransformer(transformers=[
    ['col_imp', col_imp, ['funder', 'installer', 'subvillage']],
    ['target_enc', target_enc_pipe, ['lga', 'ward', 'management']],
    ['oh_enc', OneHotEncoder(), ['basin', 'payment','water_quality', 'quantity','source','waterpoint_type', 'extraction_type_class']],
    ['bin_enc', BinaryEncoder(), ['region', 'region_code', 'district_code']], # warum alle 3 ?
    ['cust_trans', YesNoEncoder(), ['wpt_name']],
    ["num", MinMaxScaler(), ["amount_tsh", "population", "amount_per_capita", 'gps_height', 'longitude', 'latitude']],
    ["scheme", scheme_pipeline, "scheme_management"],
    ["year",  year_pipeline, "date_recorded"],
    ["age", age_pipeline, ["age"]]
], remainder='drop')

Kombination aus Alter und Ordinal Encoding

In [20]:
col_trans_lbl_w_age = ColumnTransformer(transformers=[
    ['col_imp', col_imp, ['funder', 'installer', 'subvillage']],
    ['target_enc', target_enc_pipe, ['lga', 'ward', 'management']],
    ['ord_enc', ord_enc_pipe, ['basin', 'payment','water_quality', 'quantity','source','waterpoint_type', 'extraction_type_class', 'region', 'region_code', 'district_code']],
    ['cust_trans', YesNoEncoder(), ['wpt_name']],
    ["num", MinMaxScaler(), ["amount_tsh", "population", "amount_per_capita", 'gps_height', 'longitude', 'latitude']],
    ["scheme", scheme_pipeline_lbl, "scheme_management"],
    ["year",  year_pipeline, "date_recorded"],
    ["age", age_pipeline, ["age"]]
], remainder='drop')

Erstellung eines reduzierten Featuressets (16 Merkmale) mit Ordinal Encoding:

In [21]:
col_trans_lbl_reduced = ColumnTransformer(transformers=[
    ['col_imp', col_imp, ['funder', 'installer', 'subvillage']],
    ['target_enc', target_enc_pipe, ['lga', 'ward']],
    ['ord_enc', ord_enc_pipe, ['payment','water_quality', 'quantity','source','waterpoint_type', 'extraction_type_class', 'region', 'region_code']],
    ['cust_trans', YesNoEncoder(), ['wpt_name']],
    ["num", MinMaxScaler(), ["amount_tsh"]],
    ["year",  year_pipeline, "date_recorded"]
], remainder='drop')

Fitting der Transformer auf die Trainingsdaten und Anwendung auf die selbigen:

In [22]:
X_train = col_trans.fit_transform(X=df_train, y=y_train)
X_train_lbl = col_trans_lbl.fit_transform(X=df_train, y=y_train)
X_train_w_age = col_trans_w_age.fit_transform(X=df_train, y=y_train)
X_train_lbl_w_age = col_trans_lbl_w_age.fit_transform(X=df_train, y=y_train)
X_train_lbl_reduced = col_trans_lbl_reduced.fit_transform(X=df_train, y=y_train)

Undersampling¶

Für die unterschiedlichen Ausprägungen der Target Variable status_group besteht, wie in der Data Analysis bereits festgestellt, ein Ungleichgewicht. Datenpunkte für die Gruppe 'functional needs repair' machen nur einen kleinen Teil der Gesamtheit aus. Dieses Ungeleichgewicht kann das Training unserer Modelle beeinflussen, da sich diese beim Training hauptsächlich auf die 'Majority Class' konzentrieren.

Möglichkeiten mit Ungleichgewicht umzugehen:

  • Oversampling
  • Undersampling

Wir entscheiden uns für Undersampling, da trotz der Einschränkung der Größe des Datensatzes, noch viele Datenpunkte zur Verfügung stehen.

In [23]:
from imblearn.under_sampling import RandomUnderSampler

#Trennung nach Reparaturstatus
df_train_1 = df_train[df_train["t_needs_rep"] == 1] 
df_train_0 = df_train[df_train["t_needs_rep"] == 0]

num_r = len(df_train_1)
print(f"{num_r} pumps need repairs, therefore we need the same amount of pumps that don't need repairs")


rus = RandomUnderSampler(random_state=42)
X_resampled, y_resampled = rus.fit_resample(df_train, y_train_r)
print(X_resampled["t_needs_rep"].value_counts())
3356 pumps need repairs, therefore we need the same amount of pumps that don't need repairs
0    3356
1    3356
Name: t_needs_rep, dtype: int64
In [24]:
# neue Instanzen der Transformer, teils mit setzen von handle_unknown

col_trans_u = ColumnTransformer(transformers=[
    ['col_imp', col_imp, ['funder', 'installer', 'subvillage']],
    ['target_enc', target_enc_pipe, ['lga', 'ward', 'management']],
    ['oh_enc', OneHotEncoder(handle_unknown='ignore'), ['basin', 'payment','water_quality', 'quantity','source','waterpoint_type', 'extraction_type_class']],
    ['bin_enc', BinaryEncoder(handle_unknown='ignore'), ['region', 'region_code', 'district_code']], # warum alle 3 ?
    ['cust_trans', YesNoEncoder(), ['wpt_name']],
    ["num", MinMaxScaler(), ["amount_tsh", "population", "amount_per_capita", 'gps_height', 'longitude', 'latitude']],
    ["scheme", scheme_pipeline, "scheme_management"],
    ["year",  year_pipeline, "date_recorded"]
], remainder='drop')

col_trans_lbl_w_age_u = ColumnTransformer(transformers=[
    ['col_imp', col_imp, ['funder', 'installer', 'subvillage']],
    ['target_enc', target_enc_pipe, ['lga', 'ward', 'management']],
    ['ord_enc', ord_enc_pipe, ['basin', 'payment','water_quality', 'quantity','source','waterpoint_type', 'extraction_type_class', 'region', 'region_code', 'district_code']],
    ['cust_trans', YesNoEncoder(), ['wpt_name']],
    ["num", MinMaxScaler(), ["amount_tsh", "population", "amount_per_capita", 'gps_height', 'longitude', 'latitude']],
    ["scheme", scheme_pipeline_lbl, "scheme_management"],
    ["year",  year_pipeline, "date_recorded"],
    ["age", age_pipeline, ["age"]]
], remainder='drop')

col_trans_lbl_reduced_u = ColumnTransformer(transformers=[
    ['col_imp', col_imp, ['funder', 'installer', 'subvillage']],
    ['target_enc', target_enc_pipe, ['lga', 'ward']],
    ['ord_enc', ord_enc_pipe, ['payment','water_quality', 'quantity','source','waterpoint_type', 'extraction_type_class', 'region', 'region_code']],
    ['cust_trans', YesNoEncoder(), ['wpt_name']],
    ["num", MinMaxScaler(), ["amount_tsh"]],
    ["year",  year_pipeline, "date_recorded"]
], remainder='drop')

Anwendung der ColumnTransformer auf unsere Trainingsdaten:

In [25]:
X_train_u = col_trans_u.fit_transform(X=X_resampled, y=y_resampled)
X_train_u_lbl = col_trans_lbl_w_age_u.fit_transform(X=X_resampled, y=y_resampled)
X_train_u_lbl_reduced = col_trans_lbl_reduced_u.fit_transform(X=X_resampled, y=y_resampled)
y_train_u_r = y_resampled

Prüfung auf Dimensionen nach der Transformation:

In [26]:
print(X_train.shape)
print(X_train_lbl.shape)
print(X_train_u.shape)
print(X_train_u_lbl.shape)
print(X_train_w_age.shape)
print(X_train_lbl_w_age.shape)
print(X_train_lbl_reduced.shape)
(48958, 95)
(48958, 25)
(6712, 92)
(6712, 26)
(48958, 96)
(48958, 26)
(48958, 16)
In [27]:
print(f'({X_train.min()}, {X_train.max()})')
print(f'({X_train_lbl.min()}, {X_train_lbl.max()})')
print(f'({X_train_u.min()}, {X_train.max()})')
print(f'({X_train_u_lbl.min()}, {X_train_lbl.max()})')
print(f'({X_train_w_age.min()}, {X_train_w_age.max()})')
print(f'({X_train_lbl_w_age.min()}, {X_train_lbl_w_age.max()})')
print(f'({X_train_lbl_reduced.min()}, {X_train_lbl_reduced.max()})')
(0.0, 1.0)
(0.0, 1.0)
(0.0, 1.0)
(0.0, 1.0)
(0.0, 1.0)
(0.0, 1.0)
(0.0, 1.0)

Modeling¶

Training von Prognosemodelle für die 3 Aufgaben

1. Prognosemodell: functional vs. non-functional¶

In [147]:
df_results = pd.DataFrame(data=[], columns=["Model", "AUC Train", "AUC CV","Precsion","Recall"])

1a. Model Selection¶

Logistic Regression¶
In [148]:
lr_cf = LogisticRegression(solver="sag", max_iter=1000, n_jobs= -1, class_weight='balanced')
lr_cf.fit(X_train, y_train_f)
y_pred_f = lr_cf.predict_log_proba(X_train)[:,1]

auc_train_f = roc_auc_score(y_train_f, y_pred_f)
cv_res = cross_validate(estimator=lr_cf, X=X_train, y=y_train_f, cv=10, scoring='roc_auc_ovr', n_jobs= -1)
auc_cv = cv_res["test_score"].mean()

y_pred_f_bin = lr_cf.predict(X_train)
precision = precision_score(y_train_f, y_pred_f_bin)
recall = recall_score(y_train_f, y_pred_f_bin)

df_results.loc[len(df_results)] = [lr_cf, auc_train_f, auc_cv, precision, recall]
Decision Tree¶
In [149]:
dt_cf = DecisionTreeClassifier()
dt_cf.fit(X_train, y_train_f)

y_pred_f = dt_cf.predict_proba(X_train)[:,1]

auc_train_f = roc_auc_score(y_train_f, y_pred_f)
cv_res = cross_validate(estimator=dt_cf, X=X_train, y=y_train_f, cv=10, scoring='roc_auc_ovr', n_jobs= -1)

y_pred_f_bin = dt_cf.predict(X_train)
precision = precision_score(y_train_f, y_pred_f_bin)
recall = recall_score(y_train_f, y_pred_f_bin)

df_results.loc[len(df_results)] = [dt_cf, auc_train_f, auc_cv, precision, recall]
Random Forest¶
In [150]:
rf_cf = RandomForestClassifier(random_state=42, n_jobs= -1)
rf_cf.fit(X_train, y_train_f)
y_pred_f = rf_cf.predict_proba(X_train)[:,1]
auc_train_f = roc_auc_score(y_train_f, y_pred_f)
cv_res = cross_validate(estimator=rf_cf, X=X_train, y=y_train_f, cv=10, scoring='roc_auc_ovr', n_jobs= -1)
auc_cv = cv_res["test_score"].mean()

y_pred_f_bin = rf_cf.predict(X_train)
precision = precision_score(y_train_f, y_pred_f_bin)
recall = recall_score(y_train_f, y_pred_f_bin)

df_results.loc[len(df_results)] = ['Random Forest', auc_train_f, auc_cv, precision, recall]
XG Boost¶
In [151]:
xgb_cf = XGBClassifier(eval_metric = 'auc', n_jobs= -1)
xgb_cf.fit(X_train, y_train_f)
y_pred_f = xgb_cf.predict_proba(X_train)[:,1]
auc_train = roc_auc_score(y_train_f, y_pred_f, multi_class='ovr')
cv_res = cross_validate(estimator=xgb_cf, X=X_train, y=y_train_f, cv=10, scoring='roc_auc_ovr', n_jobs=-1)
auc_cv = cv_res["test_score"].mean()

y_pred_f_bin = xgb_cf.predict(X_train)
precision = precision_score(y_train_f, y_pred_f_bin)
recall = recall_score(y_train_f, y_pred_f_bin)

df_results.loc[len(df_results)] = ['xgBoost', auc_train, auc_cv, precision, recall]
K-Nearest-Neighbors¶
In [152]:
knn_cf = KNeighborsClassifier(n_jobs= -1)
knn_cf.fit(X_train, y_train_f)
y_pred_F = knn_cf.predict_proba(X_train)

auc_train = roc_auc_score(y_train_f, y_pred_f)
cv_res = cross_validate(estimator=knn_cf, X=X_train, y=y_train_f, cv=5, scoring='roc_auc_ovr', n_jobs= -1)
auc_cv = cv_res["test_score"].mean()

y_pred_f_bin = knn_cf.predict(X_train)
precision = precision_score(y_train_f, y_pred_f_bin)
recall = recall_score(y_train_f, y_pred_f_bin)

df_results.loc[len(df_results)] = [knn_cf, auc_train, auc_cv, precision, recall]
Support Vector Machine¶
In [153]:
svm_cf = SVC(probability=True,cache_size= 3000)
svm_cf.fit(X_train, y_train_f)
y_pred_f = svm_cf.predict_proba(X_train)[:,1]
print(y_pred_f)
auc_train = roc_auc_score(y_train_f, y_pred_f)
cv_res = cross_validate(estimator=svm_cf, X=X_train, y=y_train_f, cv=5, scoring='roc_auc_ovr', n_jobs=-1)
auc_cv = cv_res["test_score"].mean()

y_pred_f_bin = svm_cf.predict(X_train)
precision = precision_score(y_train_f, y_pred_f_bin)
recall = recall_score(y_train_f, y_pred_f_bin)

df_results.loc[len(df_results)] = [svm_cf, auc_train, auc_cv, precision, recall]
[0.98465951 0.44259414 0.07015871 ... 0.9767568  0.10482034 0.16657297]
In [154]:
display(df_results)
Model AUC Train AUC CV Precsion Recall
0 LogisticRegression(class_weight='balanced', ma... 0.917535 0.916595 0.876565 0.866474
1 DecisionTreeClassifier() 1.000000 0.916595 1.000000 1.000000
2 Random Forest 1.000000 0.932227 0.999933 1.000000
3 xgBoost 0.967729 0.943200 0.893030 0.956678
4 KNeighborsClassifier(n_jobs=-1) 0.967729 0.897929 0.893092 0.937280
5 SVC(cache_size=3000, probability=True) 0.940938 0.932008 0.865777 0.943701
  • Decision Tree, Random Forest und XGBoost besitzen die höchste Güte
  • Hyperparameter Tuning dieser Modelle

1b. Hyperparameter Tuning¶

Random Forest¶
In [155]:
df_tuned_f = pd.DataFrame(data=[], columns=["Model", "AUC Train", "AUC CV", "Precision", "Recall"])
In [156]:
param_grid = [{
    "n_estimators":[570,590, 600, 700],
    "max_depth": [17, 18, 20, 25],
    "bootstrap": [True, False]
}]

rf_cf = RandomForestClassifier(random_state=42, n_jobs=-1)
rs = RandomizedSearchCV(rf_cf, param_distributions=param_grid, scoring="roc_auc_ovr", cv=10, n_jobs=-1)
rs.fit(X_train, y_train_f)
best_rf = rs.best_params_
best_rf
Out[156]:
{'n_estimators': 700, 'max_depth': 18, 'bootstrap': True}
In [157]:
param_grid = [{
 "n_estimators":[590,595],
    "max_depth": [16, 17],
    "bootstrap": [False]
}]

rf_cf = RandomForestClassifier(random_state=42, n_jobs=-1)
rs = GridSearchCV(rf_cf, param_grid=param_grid, scoring="roc_auc_ovr", cv=10, n_jobs=-1)
rs.fit(X_train, y_train_f)
best_rf = rs.best_params_
print(best_rf)
{'bootstrap': False, 'max_depth': 17, 'n_estimators': 590}
In [158]:
rf_cf_t = RandomForestClassifier(random_state=42, 
bootstrap = best_rf.get('bootstrap'), 
max_depth = best_rf.get('max_depth'), 
n_estimators =best_rf.get('n_estimators'), 
n_jobs=-1
)

rf_cf_t.fit(X_train, y_train_f)
y_pred_f = rf_cf_t.predict_proba(X_train)[:,1]

auc_train_f = roc_auc_score(y_train_f, y_pred_f)
cv_res = cross_validate(estimator=rf_cf_t, X=X_train, y=y_train_f, cv=10, scoring='roc_auc_ovr', n_jobs=-1)
auc_cv = cv_res["test_score"].mean()

y_pred_f_bin = rf_cf_t.predict(X_train)
precision = precision_score(y_train_f, y_pred_f_bin)
recall = recall_score(y_train_f, y_pred_f_bin)

df_tuned_f.loc[len(df_tuned_f)] = ["Random Forest Tuned", auc_train_f, auc_cv, precision, recall]
Decision Tree¶
In [159]:
param_grid = [{
    "criterion":["gini", "entropy"],
    "max_depth": [8, 9, 10, 11, 12],
    "min_samples_split": [3,5,6,7, 10],
    "min_samples_leaf": [1,2,3,5]
}]

dt_cf = DecisionTreeClassifier(random_state=42)
rs = RandomizedSearchCV(dt_cf, param_distributions=param_grid, scoring="roc_auc_ovr", cv=10, n_iter = 10, n_jobs=-1)
rs.fit(X_train, y_train_f)
best_dt = rs.best_params_
best_dt
Out[159]:
{'min_samples_split': 6,
 'min_samples_leaf': 5,
 'max_depth': 10,
 'criterion': 'gini'}
In [160]:
param_grid = [{
    "criterion":["gini", "entropy"],
    "max_depth": [10, 11, 12],
    "min_samples_split": [6,7,8],
    "min_samples_leaf": [2,3,4]
}]

dt_cf = DecisionTreeClassifier(random_state=42)
rs = GridSearchCV(dt_cf, param_grid=param_grid, scoring="roc_auc_ovr", cv=10, n_jobs=-1)
rs.fit(X_train, y_train_f)
best_dt = rs.best_params_
print(best_dt)
{'criterion': 'entropy', 'max_depth': 11, 'min_samples_leaf': 4, 'min_samples_split': 6}
In [161]:
dt_cf_t = DecisionTreeClassifier(
    random_state=42, 
    criterion = best_dt.get('criterion'),
    max_depth = best_dt.get('max_depth'),
    min_samples_split=best_dt.get("min_samples_split"),
    min_samples_leaf=best_dt.get("min_samples_leaf")
    )
dt_cf_t.fit(X_train, y_train_f)
y_pred_f = dt_cf_t.predict_proba(X_train)[:,1]

auc_train_f = roc_auc_score(y_train_f, y_pred_f)
cv_res = cross_validate(estimator=dt_cf_t, X=X_train, y=y_train_f, cv=10, scoring='roc_auc_ovr', n_jobs=-1)
auc_cv = cv_res["test_score"].mean()

y_pred_f_bin = dt_cf_t.predict(X_train)
precision = precision_score(y_train_f, y_pred_f_bin)
recall = recall_score(y_train_f, y_pred_f_bin)

df_tuned_f.loc[len(df_tuned_f)] = ["Decision Tree Tuned", auc_train_f, auc_cv, precision, recall]
XGBoost¶
In [162]:
param_grid = [{
    "eta":[0.1, 0.15, 0.2, 0.5],
    "max_depth": [9,10,12,15],
    "min_child_weight": [0.25, 0,5, 0.75, 1]
}]

xg_cf_t= XGBClassifier(random_state=42, n_jobs= -1, eval_metric = 'auc')
rs = RandomizedSearchCV(xg_cf_t, param_distributions=param_grid, scoring="roc_auc_ovr", cv=10, n_jobs= -1, n_iter = 10)
rs.fit(X_train, y_train_f)
best_xg = rs.best_params_
print(best_xg)
{'min_child_weight': 5, 'max_depth': 12, 'eta': 0.2}
In [163]:
param_grid = [{
   "eta":[0.2, 0.215],
    "max_depth": [9,10],
    "min_child_weight": [1, 1.25]
}]

xg_cf =XGBClassifier(random_state=42)
rs = GridSearchCV(xg_cf, param_grid=param_grid, scoring="roc_auc_ovr", cv=10, n_jobs=-1)
rs.fit(X_train, y_train_f)
best_dt = rs.best_params_
print(best_dt)
{'eta': 0.2, 'max_depth': 9, 'min_child_weight': 1.25}
In [164]:
xgb_cf_t = XGBClassifier(nthread=-1,eta=best_xg.get('eta'), max_depth= best_xg.get('max_depth'), min_child_weight= best_xg.get('min_child_weight'), eval_metric = 'auc', n_jobs= -1)
xgb_cf_t.fit(X_train, y_train_f)
y_pred_f = xgb_cf_t.predict_proba(X_train)[:,1]
auc_train = roc_auc_score(y_train_f, y_pred_f, multi_class='ovr')
cv_res = cross_validate(estimator=xgb_cf_t, X=X_train, y=y_train_f, cv=10, scoring='roc_auc_ovr',  n_jobs= -1)
auc_cv = cv_res["test_score"].mean()

y_pred_f_bin = xgb_cf_t.predict(X_train)
precision = precision_score(y_train_f, y_pred_f_bin)
recall = recall_score(y_train_f, y_pred_f_bin)

df_tuned_f.loc[len(df_tuned_f)] = [xgb_cf_t, auc_train, auc_cv, precision, recall]
In [165]:
display(df_results)
display(df_tuned_f)
Model AUC Train AUC CV Precsion Recall
0 LogisticRegression(class_weight='balanced', ma... 0.917535 0.916595 0.876565 0.866474
1 DecisionTreeClassifier() 1.000000 0.916595 1.000000 1.000000
2 Random Forest 1.000000 0.932227 0.999933 1.000000
3 xgBoost 0.967729 0.943200 0.893030 0.956678
4 KNeighborsClassifier(n_jobs=-1) 0.967729 0.897929 0.893092 0.937280
5 SVC(cache_size=3000, probability=True) 0.940938 0.932008 0.865777 0.943701
Model AUC Train AUC CV Precision Recall
0 Random Forest Tuned 0.992042 0.939364 0.926745 0.992580
1 Decision Tree Tuned 0.928126 0.910235 0.860137 0.909563
2 XGBClassifier(base_score=0.5, booster='gbtree'... 0.985583 0.943872 0.925581 0.971684
  • Verbesserte AUC für alle getunten Modelle
  • Evaluation wird zeigen, ob Overfitting vorliegt

2. Prognosemodell: needs repair yes/no¶

2a. Model Selection¶

Für das binäre Klassifizierungsproblem kommen grundsätzlich unterschiedliche Modelle in Frage. Durch Training, Vergleich und Tuning soll das beste Vorgehen zur Prognose von status_group = functional needs repair bestimmt werden.

Im ersten Schritt erfolgt das Training unterschiedlicher Modelle mit der Standard-Konfiguration, um eine Vorauswahl zu treffen.

In [28]:
df_results_r = pd.DataFrame(data=[], columns=["Model", "AUC Train", "AUC CV", "ACC CV"])

Es soll die Area Under the Curve (AUC) als Qualitätsmaß genutzt werden. Die Accuray ist jedoch auch von Interesse.

In [29]:
scoring = {"AUC": "roc_auc", "Accuracy": make_scorer(accuracy_score)}
LogisticRegression¶
In [30]:
lr_cf = LogisticRegression(random_state=42)
lr_cf.fit(X_train_u, y_train_u_r)
y_pred = lr_cf.predict_proba(X_train_u)[:, 1]
auc_train = roc_auc_score(y_train_u_r, y_pred)
cv_res = cross_validate(estimator=lr_cf, X=X_train_u, y=y_train_u_r, cv=5, scoring=scoring)
auc_cv = cv_res["test_AUC"].mean()
acc_cv = cv_res["test_Accuracy"].mean()

df_results_r.loc[len(df_results_r)] = [lr_cf, auc_train, auc_cv, acc_cv]
SVM¶
In [31]:
svm_cf = SVC(probability=True, random_state=42)
svm_cf.fit(X_train_u, y_train_u_r)
y_pred = svm_cf.predict_proba(X_train_u)[:, 1]
auc_train = roc_auc_score(y_train_u_r, y_pred)
cv_res = cross_validate(estimator=svm_cf, X=X_train_u, y=y_train_u_r, cv=5, scoring=scoring)
auc_cv = cv_res["test_AUC"].mean()
acc_cv = cv_res["test_Accuracy"].mean()

df_results_r.loc[len(df_results_r)] = [svm_cf, auc_train, auc_cv, acc_cv]
KNN¶
In [32]:
knn_cf = KNeighborsClassifier()
knn_cf.fit(X_train_u, y_train_u_r)
y_pred = knn_cf.predict_proba(X_train_u)[:, 1]
auc_train = roc_auc_score(y_train_u_r, y_pred)
cv_res = cross_validate(estimator=knn_cf, X=X_train_u, y=y_train_u_r, cv=5, scoring=scoring)
auc_cv = cv_res["test_AUC"].mean()
acc_cv = cv_res["test_Accuracy"].mean()

df_results_r.loc[len(df_results_r)] = [knn_cf, auc_train, auc_cv, acc_cv]
Decision Tree¶
In [33]:
dt_cf_ = DecisionTreeClassifier(random_state=42)
dt_cf_.fit(X_train_u, y_train_u_r)
y_pred = dt_cf_.predict_proba(X_train_u)[:, 1]
auc_train = roc_auc_score(y_train_u_r, y_pred)
cv_res = cross_validate(estimator=dt_cf_, X=X_train_u, y=y_train_u_r, cv=5, scoring=scoring)
auc_cv = cv_res["test_AUC"].mean()
acc_cv = cv_res["test_Accuracy"].mean()

df_results_r.loc[len(df_results_r)] = [dt_cf_, auc_train, auc_cv, acc_cv]
Random Forest¶
In [34]:
rf_cf = RandomForestClassifier(random_state=42)
rf_cf.fit(X_train_u, y_train_u_r)
y_pred = rf_cf.predict_proba(X_train_u)[:, 1]
auc_train = roc_auc_score(y_train_u_r, y_pred)
cv_res = cross_validate(estimator=rf_cf, X=X_train_u, y=y_train_u_r, cv=5, scoring=scoring)
auc_cv = cv_res["test_AUC"].mean()
acc_cv = cv_res["test_Accuracy"].mean()

df_results_r.loc[len(df_results_r)] = [str(rf_cf), auc_train, auc_cv, acc_cv]
XGBoost¶
In [ ]:
xgb_cf = XGBClassifier()
xgb_cf.fit(X_train_u, y_train_u_r)
y_pred = xgb_cf.predict_proba(X_train_u)[:, 1]
auc_train = roc_auc_score(y_train_u_r, y_pred)
cv_res = cross_validate(estimator=xgb_cf, X=X_train_u, y=y_train_u_r, cv=5, scoring=scoring)
auc_cv = cv_res["test_AUC"].mean()
acc_cv = cv_res["test_Accuracy"].mean()

df_results_r.loc[len(df_results_r)] = [xgb_cf, auc_train, auc_cv, acc_cv]
Model Score Comparison¶

Mit 5-facher Kreuzvalidierung wurden Scores zur Bewertung der unterschiedlichen Modelle berechnet.

  • die Ensemble-Modelle (XGBoost und RandomForest) liefern gute Werte für das Gütemaß AUC
  • bei den Baum-basierten Verfahren kann Overfitting festgestellt werden
  • auch Logistische Regression und die Support Vector Machine scheinen für die Klassifizierung anwendbar zu sein
  • der einfache Entscheidungsbaum sowie KNN liefern in der Standard-Konfiguration im Vergleich keinen guten Werte

Fazit Modellauswahl: Potenzielle Verbesserungen bei Ensemble-Modellen, Logistischer Regression sowie SVM durch Tuning erforschen. Die Ansätze Entscheidungsbaum und KNN werden nicht weiter verfolgt.

In [37]:
display(df_results_r)
Model AUC Train AUC CV ACC CV
0 LogisticRegression(random_state=42) 0.923421 0.917910 0.824793
1 SVC(probability=True, random_state=42) 0.937588 0.915691 0.826281
2 KNeighborsClassifier() 0.933883 0.843600 0.770112
3 DecisionTreeClassifier(random_state=42) 1.000000 0.771158 0.771158
4 RandomForestClassifier(random_state=42) 1.000000 0.894417 0.810490
5 XGBClassifier(base_score=0.5, booster='gbtree'... 0.990137 0.914013 0.812576

2b. Hyperparameter Tuning¶

Gewählte Modelle:

  • RandomForest
  • LogisticRegression
  • SupportVectorMachine
  • XGBoost
In [174]:
df_tuned_r = pd.DataFrame(data=[], columns=["Model", "AUC Train", "AUC CV", "ACC CV"])
Random Forest¶

Initiales Tuning mit RandomSearch, um den GridSearch Bereich einzuschränken.

In [175]:
model_params = {
    'n_estimators' : np.linspace(100, 1000, 5, dtype=int),
    'max_depth': np.linspace(5, 20, 4, dtype=int),
    'min_samples_split' : np.linspace(2, 30, 5, dtype=int),
    'min_samples_leaf' : np.linspace(1, 5, 3, dtype=int),
    'bootstrap' : [True, False]
}
rf_cf_tuned = RandomForestClassifier(random_state=42)
rs = RandomizedSearchCV(rf_cf_tuned, param_distributions=model_params, scoring="roc_auc", cv=5, n_jobs=-1)
rs.fit(X_train_u, y_train_u_r)
rs.best_params_
Out[175]:
{'n_estimators': 550,
 'min_samples_split': 30,
 'min_samples_leaf': 5,
 'max_depth': 20,
 'bootstrap': False}

GridSearch im Bereich der mit RandomSearch bestimmten optimalen Parameter.

In [176]:
param_grid = [{
    'n_estimators' : [250, 500, 750],
    'max_depth': [10, 15, 20],
    'min_samples_split' : [20, 25, 30],
    'min_samples_leaf' : [1, 2],
    'bootstrap' : [False]
}]
rf_cf_tuned = RandomForestClassifier(random_state=42)
gs = GridSearchCV(rf_cf_tuned, param_grid=param_grid, scoring="roc_auc", cv=5, n_jobs=-1)
gs.fit(X_train_u, y_train_u_r)
gs.best_params_
Out[176]:
{'bootstrap': False,
 'max_depth': 15,
 'min_samples_leaf': 1,
 'min_samples_split': 30,
 'n_estimators': 750}

Training eines Random Forests basierend auf den gefundenen optimalen Parametern.

In [177]:
rf_cf_tuned = gs.best_estimator_
rf_cf_tuned.fit(X_train_u, y_train_u_r)
y_pred = rf_cf_tuned.predict_proba(X_train_u)[:, 1]
auc_train = roc_auc_score(y_train_u_r, y_pred)
cv_res = cross_validate(estimator=rf_cf_tuned, X=X_train_u, y=y_train_u_r, cv=5, scoring=scoring)
auc_cv = cv_res["test_AUC"].mean()
acc_cv = cv_res["test_Accuracy"].mean()

df_tuned_r.loc[len(df_tuned_r)] = [str(rf_cf_tuned), auc_train, auc_cv, acc_cv]

Analoges Vorgehen für das Tuning der anderen Klassifizierer.

Logistic Regression¶
In [178]:
model_params = {
    'penalty' : ['l1', 'l2'],
    'C': np.linspace(0.1, 2, 5, dtype=float),
    'solver' : ['liblinear', 'saga', 'lbfgs']
}
lr_cf_tuned = LogisticRegression(random_state=42)
rs = RandomizedSearchCV(lr_cf_tuned, param_distributions=model_params, scoring="roc_auc", cv=5, n_jobs=-1)
rs.fit(X_train_u, y_train_u_r)
rs.best_params_
Out[178]:
{'solver': 'liblinear', 'penalty': 'l1', 'C': 1.05}
In [179]:
param_grid = [{
    'penalty' : ['l1', 'l2'],
    'C': [0.5, 0.6],
    'solver': ['lbfgs', 'saga']
}]

lr_cf_tuned = LogisticRegression(random_state=42)
gs = GridSearchCV(lr_cf_tuned, param_grid=param_grid, scoring="roc_auc", cv=5, verbose=1, n_jobs=-1)
gs.fit(X_train_u, y_train_u_r)
gs.best_params_
Fitting 5 folds for each of 8 candidates, totalling 40 fits
Out[179]:
{'C': 0.5, 'penalty': 'l1', 'solver': 'saga'}
In [180]:
lr_cf_tuned = gs.best_estimator_

lr_cf_tuned.fit(X_train_u, y_train_u_r)
y_pred = lr_cf_tuned.predict_proba(X_train_u)[:, 1]

auc_train = roc_auc_score(y_train_u_r, y_pred)
cv_res = cross_validate(estimator=lr_cf_tuned, X=X_train_u, y=y_train_u_r, cv=5, scoring=scoring)
auc_cv = cv_res["test_AUC"].mean()
acc_cv = cv_res["test_Accuracy"].mean()

df_tuned_r.loc[len(df_tuned_r)] = [lr_cf, auc_train, auc_cv, acc_cv]
SVM¶
In [181]:
param_grid = [{
    'C': [0.1, 1, 10],
    'gamma': ['auto', 'scale']
}]

svm_cf_tuned = SVC(probability=True, random_state=42, kernel="rbf")
gs = GridSearchCV(svm_cf_tuned, param_grid=param_grid, scoring="roc_auc", cv=5, n_jobs=-1)
gs.fit(X_train_u, y_train_u_r)
gs.best_params_
Out[181]:
{'C': 10, 'gamma': 'auto'}
In [182]:
svm_cf_tuned = gs.best_estimator_

svm_cf_tuned.fit(X_train_u, y_train_u_r)
y_pred = svm_cf_tuned.predict_proba(X_train_u)[:, 1]

auc_train = roc_auc_score(y_train_u_r, y_pred)
cv_res = cross_validate(estimator=svm_cf_tuned, X=X_train_u, y=y_train_u_r, cv=5, scoring=scoring, n_jobs=-1)
auc_cv = cv_res["test_AUC"].mean()
acc_cv = cv_res["test_Accuracy"].mean()

df_tuned_r.loc[len(df_tuned_r)] = [svm_cf_tuned, auc_train, auc_cv, acc_cv]
XGBoost¶
In [183]:
model_params = {
    'n_estimators' : np.linspace(100, 400, 4, dtype=int),
    'max_depth': np.linspace(5, 20, 4, dtype=int),
    'learning_rate' : np.linspace(0.1, 1, 4, dtype=int)
}
xgb_cf_tuned = XGBClassifier(use_label_encoder=False)
rs = RandomizedSearchCV(xgb_cf_tuned, param_distributions=model_params, scoring="roc_auc", cv=5, n_jobs=-1)
rs.fit(X_train_u, y_train_u_r)
rs.best_params_
Out[183]:
{'n_estimators': 200, 'max_depth': 10, 'learning_rate': 1}
In [184]:
param_grid = [{
    "n_estimators":[100, 200, 300],
    "max_depth": [5, 10],
    "learning_rate": [0.1, 0.5]
}]

xgb_cf_tuned = XGBClassifier(use_label_encoder=False)
gs = GridSearchCV(xgb_cf_tuned, param_grid=param_grid, scoring="roc_auc", cv=5, n_jobs=-1)
gs.fit(X_train_u, y_train_u_r)
gs.best_params_
Out[184]:
{'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 100}
In [185]:
xgb_cf = rs.best_estimator_

xgb_cf_tuned.fit(X_train_u, y_train_u_r)
y_pred = xgb_cf_tuned.predict_proba(X_train_u)[:, 1]
auc_train = roc_auc_score(y_train_u_r, y_pred)
cv_res = cross_validate(estimator=xgb_cf_tuned, X=X_train_u, y=y_train_u_r, cv=5, scoring=scoring)
auc_cv = cv_res["test_AUC"].mean()
acc_cv = cv_res["test_Accuracy"].mean()

df_tuned_r.loc[len(df_tuned_r)] = [xgb_cf_tuned, auc_train, auc_cv, acc_cv]
Model Score Comparison¶

Mit 5-facher Kreuzvalidierung wurden Scores zur Bewertung der unterschiedlichen optimierten Modelle berechnet.

  • alle Modelle liegen nun bei den Scores sehr Nahe beeinander
  • für XGBoost kann immer noch Overfitting unterstellt werden

Fazit Tuning: Evaluation auf den Testdaten muss zeigen, welches Prognosemodell den Status functional needs repair am zuverlässigsten vorraussagen kann. Ein weitere Einflussfaktor für die Modellauswahl sind jedoch auch die "Trainingskosten" (in diesem Kontext: Zeit) der Modelle, hier schneidet die SVM im Vergleich schlechter ab als die anderen Optionen.

In [186]:
display(df_tuned_r)
Model AUC Train AUC CV ACC CV
0 RandomForestClassifier(bootstrap=False, max_de... 0.967892 0.911201 0.825239
1 LogisticRegression(random_state=42) 0.922846 0.918654 0.826580
2 SVC(C=10, gamma='auto', probability=True, rand... 0.931920 0.918766 0.831198
3 XGBClassifier(base_score=0.5, booster='gbtree'... 0.990137 0.914013 0.812576

2c. Andere Input Features¶

Random Forest Ordinal Econding¶

Ordinal Encoding erzeugt wesentlich weniger Features als OneHot Encoding, suggeriert jedoch auch eine Reihung der kategorischen Feature-Ausprägungen, was nicht immer im Sinne der Daten ist.

Wirkt sich Ordinal Encoding positiv auf die Performance des Modells aus? Ein Experiment mit Random Forest.

In [188]:
param_grid = [{
    'n_estimators' : [50, 100, 150],
    'max_depth': [5, 10, 15],
    'min_samples_split' : [20, 25, 30],
    'min_samples_leaf' : [1, 2],
    'bootstrap' : [False]
}]
rf_cf_lbl = RandomForestClassifier(random_state=42)
gs = GridSearchCV(rf_cf_lbl, param_grid=param_grid, scoring="roc_auc", cv=5, n_jobs=-1)
gs.fit(X_train_u_lbl, y_train_u_r)
gs.best_params_
Out[188]:
{'bootstrap': False,
 'max_depth': 10,
 'min_samples_leaf': 1,
 'min_samples_split': 25,
 'n_estimators': 150}
In [189]:
rf_cf_lbl = gs.best_estimator_
rf_cf_lbl.fit(X_train_u_lbl, y_train_u_r)
y_pred = rf_cf_lbl.predict_proba(X_train_u_lbl)[:, 1]
auc_train = roc_auc_score(y_train_u_r, y_pred)
cv_res = cross_validate(estimator=rf_cf_lbl, X=X_train_u_lbl, y=y_train_u_r, cv=5, scoring=scoring)
auc_cv = cv_res["test_AUC"].mean()
acc_cv = cv_res["test_Accuracy"].mean()

print(f"{str(rf_cf_lbl)} \nAUC Train: {auc_train}\nAUC CV: {auc_cv}\nACC CV: {acc_cv}")
RandomForestClassifier(bootstrap=False, max_depth=10, min_samples_split=25,
                       n_estimators=150, random_state=42) 
AUC Train: 0.9537723338272335
AUC CV: 0.9183492688465998
ACC CV: 0.827623389147015

Fazit Ordinal Encoding: Das Modell unterscheidet sich nur unwesentlich von dem zuvor trainierten Random Forest. Es ist keine wesentliche Verbesserung durch die Änderung des Encodings der kategoriellen Features festzustellen.

2d. Reduced Input Features¶

Aufgrund der vielen Informationen, welche zu den Brunnen verfügbar sind, ist anzunehmen, dass nicht alle Features für die Klassifizierung relevant sind. Ein Experiment mit einem Random Forest soll überprüfen, ob auch nur mit einer bewussten Auswahl von Features gute Prognosen erzielt werden können.

In [190]:
rf_cf_red = RandomForestClassifier(random_state=42)
rf_cf_red.fit(X_train_u_lbl_reduced, y_train_u_r)
y_pred = rf_cf_red.predict_proba(X_train_u_lbl_reduced)[:, 1]
auc_train = roc_auc_score(y_train_u_r, y_pred)
cv_res = cross_validate(estimator=rf_cf_red, X=X_train_u_lbl, y=y_train_u_r, cv=5, scoring=scoring)
auc_cv = cv_res["test_AUC"].mean()
acc_cv = cv_res["test_Accuracy"].mean()

print(f"{str(rf_cf_red)} \nAUC Train: {auc_train}\nAUC CV: {auc_cv}\nACC CV: {acc_cv}")
RandomForestClassifier(random_state=42) 
AUC Train: 0.994516385716579
AUC CV: 0.8994907783381831
ACC CV: 0.8095965945849374

Fazit Reduced Features: Das entsprechende Modell performt merklich schlechter als zuvor erzielte Ergebnisse. Durch die ausgeschlossenen Features lies sich zwar die Dimension des Problems verkleinern, jedoch scheinen zu viele Informationen verloren zu gehen.

3. Prognosemodell: functional vs. non-functional vs. functional_needs_repair¶

3a. Model Selection¶

In [193]:
df_results = pd.DataFrame(data=[], columns=["Model", "AUC Train", "AUC CV"])
LogisticRegression¶
In [194]:
lr_cf = LogisticRegression(multi_class="ovr", max_iter=1000)
lr_cf.fit(X_train, y_train)
y_pred = lr_cf.predict_proba(X_train)
auc_train = roc_auc_score(y_train, y_pred, multi_class='ovr')
cv_res = cross_validate(estimator=lr_cf, X=X_train, y=y_train, cv=5, scoring='roc_auc_ovr')
auc_cv = cv_res["test_score"].mean()
df_results.loc[len(df_results)] = [lr_cf, auc_train, auc_cv]
SVM¶
In [195]:
svm_cf = SVC(probability=True, decision_function_shape="ovr")
svm_cf.fit(X_train, y_train)
y_pred = svm_cf.predict_proba(X_train)
auc_train = roc_auc_score(y_train, y_pred, multi_class='ovr')
cv_res = cross_validate(estimator=svm_cf, X=X_train, y=y_train, cv=5, scoring='roc_auc_ovr', n_jobs=-1)
auc_cv = cv_res["test_score"].mean()
df_results.loc[len(df_results)] = [svm_cf, auc_train, auc_cv]
KNN¶
In [196]:
knn_cf = KNeighborsClassifier()
knn_cf.fit(X_train, y_train)
y_pred = knn_cf.predict_proba(X_train)
auc_train = roc_auc_score(y_train, y_pred, multi_class='ovr')
cv_res = cross_validate(estimator=knn_cf, X=X_train, y=y_train, cv=5, scoring='roc_auc_ovr')
auc_cv = cv_res["test_score"].mean()
df_results.loc[len(df_results)] = [knn_cf, auc_train, auc_cv]
Decision Tree¶
In [197]:
dt_cf = DecisionTreeClassifier()
dt_cf.fit(X_train, y_train)
y_pred = dt_cf.predict_proba(X_train)
auc_train = roc_auc_score(y_train, y_pred, multi_class='ovr')
cv_res = cross_validate(estimator=dt_cf, X=X_train, y=y_train, cv=5, scoring='roc_auc_ovr')
auc_cv = cv_res["test_score"].mean()
df_results.loc[len(df_results)] = [dt_cf, auc_train, auc_cv]
Random Forest¶
In [198]:
rf_cf = RandomForestClassifier(random_state=42)
rf_cf.fit(X_train, y_train)
y_pred = rf_cf.predict_proba(X_train)
auc_train = roc_auc_score(y_train, y_pred, multi_class='ovr')
cv_res = cross_validate(estimator=rf_cf, X=X_train, y=y_train, cv=5, scoring='roc_auc_ovr')
auc_cv = cv_res["test_score"].mean()
df_results.loc[len(df_results)] = [str(rf_cf), auc_train, auc_cv]
XGBoost¶
In [199]:
xgb_cf = XGBClassifier()
xgb_cf.fit(X_train, y_train)
y_pred = xgb_cf.predict_proba(X_train)
auc_train = roc_auc_score(y_train, y_pred, multi_class='ovr')
cv_res = cross_validate(estimator=xgb_cf, X=X_train, y=y_train, cv=5, scoring='roc_auc_ovr')
auc_cv = cv_res["test_score"].mean()
df_results.loc[len(df_results)] = [xgb_cf, auc_train, auc_cv]
In [200]:
df_results
Out[200]:
Model AUC Train AUC CV
0 LogisticRegression(max_iter=1000, multi_class=... 0.869001 0.866267
1 SVC(probability=True) 0.912181 0.893104
2 KNeighborsClassifier() 0.957829 0.855479
3 DecisionTreeClassifier() 1.000000 0.751493
4 RandomForestClassifier(random_state=42) 1.000000 0.902239
5 XGBClassifier(base_score=0.5, booster='gbtree'... 0.964362 0.925640

Am besten performen die Ensemble-Modelle Random Forest und XGBoost. Der Decision Tree overfittet stark, die SVM wird nachfolgend aufgrund der Laufzeit ignoriert. Random Forest und XGBoost overfitten auch etwas, Verbesserungen durch Hyperparameter Tuning sind wahrscheinlich.

3b. Hyperparameter Tuning¶

  • Random Forest
  • XGBoost
In [201]:
df_tuned = pd.DataFrame(data=[], columns=["Model", "AUC Train", "AUC CV"])
Random Forest¶
In [202]:
rf_cf = RandomForestClassifier(random_state=42)

model_params = {
    'n_estimators': np.linspace(10,500,50,dtype=int),
    'max_depth': np.linspace(2,30,15,dtype=int),
}

rs = RandomizedSearchCV(rf_cf, model_params, n_iter=20, scoring="roc_auc_ovr", cv=5, n_jobs=-1, random_state=42)
rs.fit(X_train, y_train)
rs.best_params_
Out[202]:
{'n_estimators': 130, 'max_depth': 16}
In [203]:
param_grid = [{
    "n_estimators":[300, 350, 400, 450, 500, 550],
    "max_depth": [16, 18, 20],
}]

rf_cf_opt = RandomForestClassifier(random_state=42)
gs = GridSearchCV(rf_cf_opt, param_grid=param_grid, scoring="roc_auc_ovr", cv=5, n_jobs=-1)
gs.fit(X_train, y_train)
print(gs.best_score_)
print(gs.best_params_)
0.916088281138068
{'max_depth': 18, 'n_estimators': 550}
In [204]:
display(pd.DataFrame(gs.cv_results_)[["param_max_depth", "param_n_estimators", "mean_test_score", "rank_test_score"]])
param_max_depth param_n_estimators mean_test_score rank_test_score
0 16 300 0.915558 12
1 16 350 0.915649 11
2 16 400 0.915800 9
3 16 450 0.915867 7
4 16 500 0.915923 4
5 16 550 0.915915 6
6 18 300 0.915744 10
7 18 350 0.915837 8
8 18 400 0.915943 3
9 18 450 0.915918 5
10 18 500 0.915974 2
11 18 550 0.916088 1
12 20 300 0.914980 18
13 20 350 0.915065 17
14 20 400 0.915187 14
15 20 450 0.915175 15
16 20 500 0.915146 16
17 20 550 0.915264 13

Finetuning bringt nur geringe Verbesserungen

In [394]:
rf_cf_opt = gs.best_estimator_
# rf_cf_opt = RandomForestClassifier(random_state=42, max_depth=18, n_estimators=550)

rf_cf_opt.fit(X_train, y_train)
y_pred = rf_cf_opt.predict_proba(X_train)

auc_train = roc_auc_score(y_train, y_pred, multi_class='ovr')
cv_res = cross_validate(estimator=rf_cf_opt, X=X_train, y=y_train, cv=5, scoring='roc_auc_ovr')
auc_cv = cv_res["test_score"].mean()

df_tuned.loc[len(df_tuned)] = [f"Random Forest - optimized - {gs.best_params_}", auc_train, auc_cv]
XGBoost¶
In [206]:
xgb_cf = XGBClassifier()

model_params = {
    'n_estimators': np.linspace(10,400,50,dtype=int),
    'max_depth': np.linspace(2,30,15,dtype=int),
    'learning_rate': np.linspace(0.1,1,10)
}

rs = RandomizedSearchCV(xgb_cf, model_params, n_iter=15, scoring="roc_auc_ovr", cv=5, n_jobs=-1, random_state=42)
rs.fit(X_train, y_train)
rs.best_params_
Out[206]:
{'n_estimators': 344, 'max_depth': 4, 'learning_rate': 0.5}
In [207]:
param_grid = [{
    "n_estimators":[250, 300, 350],
    "max_depth": [4, 6, 8, 10],
    "learning_rate": [0.1, 0.5, 1]
}]

xgb_cf = XGBClassifier()
gs = GridSearchCV(xgb_cf, param_grid=param_grid, scoring="roc_auc_ovr", cv=5, n_jobs=-1)
gs.fit(X_train, y_train)
gs.best_params_
Out[207]:
{'learning_rate': 0.1, 'max_depth': 8, 'n_estimators': 250}
In [208]:
xgb_cf_opt = gs.best_estimator_

xgb_cf_opt.fit(X_train, y_train)
y_pred = xgb_cf_opt.predict_proba(X_train)

auc_train = roc_auc_score(y_train, y_pred, multi_class='ovr')
cv_res = cross_validate(estimator=xgb_cf_opt, X=X_train, y=y_train, cv=5, scoring='roc_auc_ovr')
auc_cv = cv_res["test_score"].mean()

df_tuned.loc[len(df_tuned)] = [f"XGBoost - optimized - {gs.best_params_}", auc_train, auc_cv]
In [209]:
df_tuned
Out[209]:
Model AUC Train AUC CV
0 Random Forest - optimized - {'max_depth': 18, ... 0.992420 0.916088
1 XGBoost - optimized - {'learning_rate': 0.1, '... 0.978028 0.926525

Beide optimierten Modelle stellen eine Verbesserung gegenüber ihren Default-Varianten dar.

3c. Ansatz: Dimensionsreduktion¶

Aktuell wird mit sehr vielen Features trainiert (encodiert: 96 Spalten). Viele davon sind vorrangig mit 0en oder 1en befüllt (One-Hot und Binary Encoding). Der Feature-Raum kann also durch Dimensionsreduktion höchstwahrscheinlich noch stark komprimiert werden (zur Vermeidung des Fluchs der Dimensionalität, schnellerem Training, ...)


Lösung: PCA

In [210]:
pca_pl = PCA(n_components=len(X_train.T))
pca_pl.fit(X_train)

plt.plot(np.arange(pca_pl.n_components_) + 1, np.cumsum(pca_pl.explained_variance_ratio_), linewidth=1, color='red')
plt.title('Elbow Plot')
plt.xlabel('Principal Components')
plt.ylabel('Explained Variance Ratio')
plt.hlines(0.95, -1, 100, linestyle="--", linewidth=1, color="grey")
plt.vlines(40, 0, 1.1, linestyle="--", linewidth=1, color="grey")
plt.xlim(0,96)
plt.ylim(0,1.1)
plt.show()

Kein klarer "elbow" erkennbar, Beibehaltung von 95% der Varianz scheint sinnvoll, d.h. ca. 40 Komponenten

In [211]:
pca = PCA(n_components=40)
pca.fit(X_train)
X_train_ = pca.transform(X_train)
In [212]:
rf_cf = RandomForestClassifier(random_state=42)
rf_cf.fit(X_train_, y_train)
y_pred = rf_cf.predict_proba(X_train_)
auc_train = roc_auc_score(y_train, y_pred, multi_class='ovr')
cv_res = cross_validate(estimator=rf_cf, X=X_train_, y=y_train, cv=5, scoring='roc_auc_ovr')
auc_cv = cv_res["test_score"].mean()
print(f"{auc_train}, {auc_cv}")
1.0, 0.8747497544670162
In [213]:
xgb_cf = XGBClassifier()
xgb_cf.fit(X_train_, y_train)
y_pred = xgb_cf.predict_proba(X_train_)
auc_train = roc_auc_score(y_train, y_pred, multi_class='ovr')
cv_res = cross_validate(estimator=xgb_cf, X=X_train_, y=y_train, cv=5, scoring='roc_auc_ovr')
auc_cv = cv_res["test_score"].mean()
print(f"{auc_train}, {auc_cv}")
0.9699099128236064, 0.90438109044224

Die Default-Modelle performen nun schlechter als zuvor (vielleicht wurden doch wichtige Informationen entfernt?), der Ansatz mit PCA wird deshalb nachfolgend nicht mehr weiter verfolgt.

3d. Imbalance - Ansatz: class weights¶

Wie in der EDA sichtbar wurde, ist auch beim multinomialen Klassifikationsproblem eine Imbalance zwischen den 3 Klassen vorhanden. Eine Möglichkeit hiermit umzugehen, wurde oben bereits gezeigt (Undersampling). Hier wird nun eine andere Methode verwendet: class bzw. sample weights. Es werden Gewichte für jede Klasse berechnet, die in die Loss-Funktion beim Training eingehen. Für die kleinere Klasse werden höhere Gewichte gewählt, so dass deren Wichtigkeit zunimmt.

XGBoost¶

In [214]:
classes_weights = list(class_weight.compute_class_weight('balanced', classes=np.unique(y_train), y=y_train))

weights = np.ones(y_train.shape[0], dtype='float')

for i, val in enumerate(y_train):
    weights[i] = classes_weights[val-1]

xgb_cf_w = XGBClassifier()
xgb_cf_w.fit(X_train, y_train, sample_weight=weights);
In [215]:
y_pred = xgb_cf_w.predict_proba(X_train)
auc_train = roc_auc_score(y_train, y_pred, multi_class='ovr')
cv_res = cross_validate(estimator=xgb_cf_w, X=X_train, y=y_train, cv=5, scoring='roc_auc_ovr')
auc_cv = cv_res["test_score"].mean()
print(f"{auc_train}, {auc_cv}")
0.9594796427886951, 0.9256401412787248
In [216]:
param_grid = [{
    "n_estimators":[100, 150, 200, 250, 300],
    "max_depth": [4, 6, 8, 10],
    "learning_rate": [0.1, 0.5, 1]
}]

xgb_cf = XGBClassifier()
gs = GridSearchCV(xgb_cf, param_grid=param_grid, scoring="roc_auc_ovr", cv=5, n_jobs=-1)
gs.fit(X_train, y_train, sample_weight=weights)#fit_params={'sample_weight': weights})

xgb_cf_w_opt = gs.best_estimator_

xgb_cf_w_opt.fit(X_train, y_train, sample_weight=weights)
y_pred = xgb_cf_w_opt.predict_proba(X_train)

auc_train = roc_auc_score(y_train, y_pred, multi_class='ovr')
cv_res = cross_validate(estimator=xgb_cf_w_opt, X=X_train, y=y_train, cv=5, scoring='roc_auc_ovr')
auc_cv = cv_res["test_score"].mean()

df_tuned.loc[len(df_tuned)] = [f"XGBoost - weights -optimized {gs.best_params_}", auc_train, auc_cv]

Random Forest¶

In [217]:
rf_cf_w = RandomForestClassifier(random_state=42)
rf_cf_w.fit(X_train, y_train, sample_weight=weights)
y_pred = rf_cf_w.predict_proba(X_train)
auc_train = roc_auc_score(y_train, y_pred, multi_class='ovr')
cv_res = cross_validate(estimator=rf_cf_w, X=X_train, y=y_train, cv=5, scoring='roc_auc_ovr')
auc_cv = cv_res["test_score"].mean()
print(f"{auc_train}, {auc_cv}")
0.9999999997066454, 0.9022388708551381
In [218]:
param_grid = [{
    "n_estimators":[300, 350, 400, 450, 500, 550],
    "max_depth": [16, 18, 20],
}]

rf_cf_w_opt = RandomForestClassifier(random_state=42)
gs = GridSearchCV(rf_cf_w_opt, param_grid=param_grid, scoring="roc_auc_ovr", cv=5, n_jobs=-1)
gs.fit(X_train, y_train, sample_weight=weights)#fit_params={'sample_weight': weights})

rf_cf_w_opt = gs.best_estimator_

rf_cf_w_opt.fit(X_train, y_train, sample_weight=weights)
y_pred = rf_cf_w_opt.predict_proba(X_train)

auc_train = roc_auc_score(y_train, y_pred, multi_class='ovr')
cv_res = cross_validate(estimator=rf_cf_w_opt, X=X_train, y=y_train, cv=5, scoring='roc_auc_ovr')
auc_cv = cv_res["test_score"].mean()

df_tuned.loc[len(df_tuned)] = [f"Random Forest - weights -optimized {gs.best_params_}", auc_train, auc_cv]

In beiden Fällen wird die AUC nicht signifikant besser. Durch Optimierung mit roc_auc_ovr als Metrik wird bereits zuvor eine gute Vorhersage auch für kleinere Klassen in gewisser Weise erzwungen.

3e. Andere Input-Features¶

Das Konstruktionsjahr wurde bislang ignoriert. In den nachfolgenden Fällen geht dieses Feature als Alter des Brunnens (konstruiiert aus date_recorded - construction_year) in das Modell mit ein.

Random Forest¶
In [220]:
rf_cf = RandomForestClassifier(random_state=42)
rf_cf.fit(X_train_w_age , y_train)
y_pred = rf_cf.predict_proba(X_train_w_age)
auc_train = roc_auc_score(y_train, y_pred, multi_class='ovr')
cv_res = cross_validate(estimator=rf_cf, X=X_train_w_age, y=y_train, cv=5, scoring='roc_auc_ovr', n_jobs=-1)
auc_cv = cv_res["test_score"].mean()
print(f'{auc_train}, {auc_cv}')
0.9999999943007573, 0.903358859215191

Kleine Verbesserung, deshalb Hyperparamter Optimierung für dieses Modell.

In [221]:
param_grid = [{
    "n_estimators":[400, 450, 500, 550],
    "max_depth": [16, 18, 20],
}]

rf_cf = RandomForestClassifier(random_state=42)
gs = GridSearchCV(rf_cf, param_grid=param_grid, scoring="roc_auc_ovr", cv=5, n_jobs=-1)
gs.fit(X_train_w_age, y_train)
Out[221]:
GridSearchCV(cv=5, estimator=RandomForestClassifier(random_state=42), n_jobs=-1,
             param_grid=[{'max_depth': [16, 18, 20],
                          'n_estimators': [400, 450, 500, 550]}],
             scoring='roc_auc_ovr')
In [222]:
rf_cf_opt_age = gs.best_estimator_

rf_cf_opt_age.fit(X_train_w_age, y_train)
y_pred = rf_cf_opt_age.predict_proba(X_train_w_age)

auc_train = roc_auc_score(y_train, y_pred, multi_class='ovr')
cv_res = cross_validate(estimator=rf_cf_opt_age, X=X_train_w_age, y=y_train, cv=5, scoring='roc_auc_ovr')
auc_cv = cv_res["test_score"].mean()

df_tuned.loc[len(df_tuned)] = [f"Random Forest - with Age - optimized - {gs.best_params_}", auc_train, auc_cv]
In [223]:
xgb_cf = XGBClassifier()
xgb_cf.fit(X_train_w_age, y_train)
y_pred = xgb_cf.predict_proba(X_train_w_age)
auc_train = roc_auc_score(y_train, y_pred, multi_class='ovr')
cv_res = cross_validate(estimator=xgb_cf, X=X_train_w_age, y=y_train, cv=5, scoring='roc_auc_ovr', n_jobs=-1)
auc_cv = cv_res["test_score"].mean()
print(f'{auc_train}, {auc_cv}')
0.9665996207049984, 0.9261132067285827
In [224]:
param_grid = [{
    "n_estimators":[250, 300, 350],
    "max_depth": [4, 6, 8, 10],
    "learning_rate": [0.1, 0.5, 1]
}]

xgb_cf = XGBClassifier()
gs = GridSearchCV(xgb_cf, param_grid=param_grid, scoring="roc_auc_ovr", cv=5, n_jobs=-1)
gs.fit(X_train_w_age, y_train)
gs.best_params_
Out[224]:
{'learning_rate': 0.1, 'max_depth': 8, 'n_estimators': 300}
In [390]:
xgb_cf_opt_age = gs.best_estimator_

xgb_cf_opt_age.fit(X_train_w_age, y_train)
y_pred = xgb_cf_opt_age.predict_proba(X_train_w_age)

auc_train = roc_auc_score(y_train, y_pred, multi_class='ovr')
cv_res = cross_validate(estimator=xgb_cf_opt_age, X=X_train_w_age, y=y_train, cv=5, scoring='roc_auc_ovr')
auc_cv = cv_res["test_score"].mean()

df_tuned.loc[len(df_tuned)] = [f"XGBoost - with Age - optimized {gs.best_params_}", auc_train, auc_cv]
In [232]:
df_tuned
Out[232]:
Model AUC Train AUC CV
0 Random Forest - optimized - {'max_depth': 18, ... 0.992420 0.916088
1 XGBoost - optimized - {'learning_rate': 0.1, '... 0.978028 0.926525
2 XGBoost - weights -optimized {'learning_rate':... 0.978048 0.926460
3 Random Forest - weights -optimized {'max_depth... 0.993145 0.915264
4 Random Forest - with Age - optimized - {'max_d... 0.992760 0.916654
5 XGBoost - with Age - optimized {'learning_rate... 0.983787 0.927360

Evaluation¶

Out-of-Sample Evaluation der besten Modelle mit einem separat preparierten Test-Set.

  • Berechnung des AUC-Scores auf dem Test-Set
  • Visualisierung der ROC-Kurven und Confusion-Matrix
In [258]:
df_pump_test.head()
Out[258]:
id amount_tsh date_recorded funder gps_height installer longitude latitude wpt_name num_private ... water_quality quality_group quantity quantity_group source source_type source_class waterpoint_type waterpoint_type_group status_group
0 69572 6000.0 2011-03-14 Roman 1390 Roman 34.938093 -9.856322 none 0 ... soft good enough enough spring spring groundwater communal standpipe communal standpipe functional
1 8776 0.0 2013-03-06 Grumeti 1399 GRUMETI 34.698766 -2.147466 Zahanati 0 ... soft good insufficient insufficient rainwater harvesting rainwater harvesting surface communal standpipe communal standpipe functional
2 49056 0.0 2011-02-20 Private 62 Private 39.209518 -7.034139 Mzee Hokororo 0 ... salty salty enough enough machine dbh borehole groundwater other other functional
3 58155 0.0 2011-09-04 Unicef 1656 DWE 34.569266 -9.085515 Kwa Rose Chaula 0 ... soft good dry dry river river/lake surface communal standpipe communal standpipe non functional
4 34169 0.0 2011-07-22 Hesawa 1162 DWE 32.920154 -1.947868 Ngomee 0 ... milky milky insufficient insufficient spring spring groundwater other other functional needs repair

5 rows × 41 columns

In [259]:
df_pump_test.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8910 entries, 0 to 8909
Data columns (total 41 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   id                     8910 non-null   int64  
 1   amount_tsh             8910 non-null   float64
 2   date_recorded          8910 non-null   object 
 3   funder                 8369 non-null   object 
 4   gps_height             8910 non-null   int64  
 5   installer              8365 non-null   object 
 6   longitude              8910 non-null   float64
 7   latitude               8910 non-null   float64
 8   wpt_name               8910 non-null   object 
 9   num_private            8910 non-null   int64  
 10  basin                  8910 non-null   object 
 11  subvillage             8854 non-null   object 
 12  region                 8910 non-null   object 
 13  region_code            8910 non-null   int64  
 14  district_code          8910 non-null   int64  
 15  lga                    8910 non-null   object 
 16  ward                   8910 non-null   object 
 17  population             8910 non-null   int64  
 18  public_meeting         8422 non-null   object 
 19  recorded_by            8910 non-null   object 
 20  scheme_management      8302 non-null   object 
 21  scheme_name            4686 non-null   object 
 22  permit                 8445 non-null   object 
 23  construction_year      8910 non-null   int64  
 24  extraction_type        8910 non-null   object 
 25  extraction_type_group  8910 non-null   object 
 26  extraction_type_class  8910 non-null   object 
 27  management             8910 non-null   object 
 28  management_group       8910 non-null   object 
 29  payment                8910 non-null   object 
 30  payment_type           8910 non-null   object 
 31  water_quality          8910 non-null   object 
 32  quality_group          8910 non-null   object 
 33  quantity               8910 non-null   object 
 34  quantity_group         8910 non-null   object 
 35  source                 8910 non-null   object 
 36  source_type            8910 non-null   object 
 37  source_class           8910 non-null   object 
 38  waterpoint_type        8910 non-null   object 
 39  waterpoint_type_group  8910 non-null   object 
 40  status_group           8910 non-null   object 
dtypes: float64(3), int64(7), object(31)
memory usage: 2.8+ MB
  • NULL-Werte nur in Spalten, in der auch beim Train-Set NULL-Werte waren
  • Einsatz des Test-Sets sollte nach Transformation mit dem ColumnTransformer problemlos möglich sein
In [336]:
#Selbe Transformation wie auf den Trainingsdaten
df_test = df_pump_test.copy()

df_test["t_functional"] = df_test["status_group"].apply(lambda x: 0 if x=="non functional" else 1)
df_test["t_needs_rep"] = df_test["status_group"].apply(lambda x: 1 if x=="functional needs repair" else 0)

df_test = df_test.drop(df_test[df_test.longitude == 0].index)

df_test["amount_per_capita"] = df_test.apply(lambda x: x["amount_tsh"]/x["population"] if x["population"] != 0 else 0, axis=1)
df_test["age"]=df_test['date_recorded'].apply(lambda x: datetime.datetime.strptime(x,'%Y-%m-%d'))-df_test["construction_year"].apply(lambda x: datetime.datetime.strptime(str(x),'%Y') if x!=0 else datetime.datetime.now())
df_test["age"].loc[df_test["construction_year"]!=0]=df_test["age"].loc[df_test["construction_year"]!=0].apply(lambda x: x.days)
df_test["age"].loc[df_test["construction_year"]==0]=np.nan

df_test["district_code"] = df_test["district_code"].astype("category")
df_test["region_code"] = df_test["region_code"].astype("category")

y_test = df_test.loc[:,'status_group']
le = LabelEncoder()
y_test = le.fit_transform(y_test)
y_test_f = df_test["t_functional"]
y_test_r = df_test["t_needs_rep"]
In [337]:
X_test = col_trans.transform(X=df_test)
X_test_lbl = col_trans_lbl.transform(X=df_test)
X_test_w_age = col_trans_w_age.transform(X=df_test)
X_test_lbl_w_age = col_trans_lbl_w_age.transform(X=df_test)
X_test_lbl_reduced = col_trans_lbl_reduced.transform(X=df_test)
X_test_u = col_trans_u.transform(X=df_test)
X_test_lbl_w_age_u = col_trans_lbl_w_age_u.transform(X=df_test)
X_test_lbl_reduced_u = col_trans_lbl_reduced_u.transform(X=df_test)
In [338]:
print(X_test.shape)
print(X_test_lbl.shape)
print(X_test_w_age.shape)
print(X_test_lbl_w_age.shape)
print(X_test_lbl_reduced.shape)
print(X_test_u.shape)
print(X_test_lbl_w_age_u.shape)
print(X_test_lbl_reduced_u.shape)
print(f'({X_test.min()}, {X_test.max()})')
print(f'({X_test_lbl.min()}, {X_test_lbl.max()})')
print(f'({X_test_w_age.min()}, {X_test_w_age.max()})')
print(f'({X_test_lbl_w_age.min()}, {X_test_lbl_w_age.max()})')
print(f'({X_test_lbl_reduced.min()}, {X_test_lbl_reduced.max()})')
print(f'({X_test_u.min()}, {X_test_u.max()})')
print(f'({X_test_lbl_w_age_u.min()}, {X_test_lbl_w_age_u.max()})')
print(f'({X_test_lbl_reduced_u.min()}, {X_test_lbl_reduced_u.max()})')
(8630, 95)
(8630, 25)
(8630, 96)
(8630, 26)
(8630, 16)
(8630, 92)
(8630, 26)
(8630, 16)
(-9.977452173104417e-05, 1.0)
(-9.977452173104417e-05, 1.0)
(-9.977452173104417e-05, 1.0)
(-9.977452173104417e-05, 1.0)
(0.0, 1.0)
(-0.0028318584070796474, 1.8115942028985506)
(-0.0028318584070796474, 1.8115942028985506)
(0.0, 1.8115942028985506)
In [342]:
from yellowbrick.classifier import ROCAUC

def plot_ROC(model, X_train, y_train, X_test, y_test):

    vis = ROCAUC(model, encoder={0: 'functional', 1: ' func_needs_rep', 2: 'non_functional'})
                                                                         
    vis.fit(X_train, y_train)
    vis.score(X_test, y_test)
    vis.show()
    
    return vis

from sklearn.metrics import plot_confusion_matrix

1. Prognosemodell: functional vs. non-functional¶

In [363]:
df_act_pred = pd.DataFrame(data=[], columns=["Model", "Actuals", "Predicted"])
In [364]:
plt.figure(0).clf()

#Tuned XGB
y_pred_f = xgb_cf_t.predict_proba(X_test)[:,1]
y_pred_f_bin = xgb_cf_t.predict(X_test)
fpr, tpr, thresh = metrics.roc_curve(y_test_f, y_pred_f)
auc = metrics.roc_auc_score(y_test_f, y_pred_f)

plt.plot(fpr,tpr,label="XG Boost Tuned, auc="+str(auc))

df_act_pred.loc[len(df_act_pred)] = ["XG Boost Tuned", y_test_f, y_pred_f_bin]


#tuned Random Forest
y_pred_f = rf_cf_t.predict_proba(X_test)[:,1]
y_pred_f_bin = rf_cf_t.predict(X_test)
fpr, tpr, thresh = metrics.roc_curve(y_test_f, y_pred_f)
auc = metrics.roc_auc_score(y_test_f, y_pred_f)
plt.plot(fpr,tpr,label="Tuned Random Forest Tuned, auc="+str(auc))

df_act_pred.loc[len(df_act_pred)] = ["Random Forest Tuned", y_test_f, y_pred_f_bin]

print("Random Forest Tuned")
print(classification_report(y_test_f, y_pred_f_bin))

#Tuned Decision Tree
y_pred_f = dt_cf_t.predict_proba(X_test)[:,1]
y_pred_f_bin = dt_cf_t.predict(X_test)
fpr, tpr, thresh = metrics.roc_curve(y_test_f, y_pred_f)
auc = metrics.roc_auc_score(y_test_f, y_pred_f)
plt.plot(fpr,tpr,label="Decision Tree Tuned, auc="+str(auc))

df_act_pred.loc[len(df_act_pred)] = ["Decision Tree Tuned", y_test_f, y_pred_f_bin]

plt.legend(loc=0)
plt.show()
Random Forest Tuned
              precision    recall  f1-score   support

           0       0.86      0.73      0.79      3364
           1       0.84      0.92      0.88      5266

    accuracy                           0.85      8630
   macro avg       0.85      0.83      0.83      8630
weighted avg       0.85      0.85      0.84      8630

Die AUC Kurve auf den Testdaten zeigt solide Ergebnisse

  • Der getunte Random Forest besitzt die höchste Güte
  • die anderen Modelle schneiden schlechter als der ungetunte XG Boost ab, besitzen aber eine solide AUC


Getunter Random Forest löst die 1. Fragestellung

In [367]:
fig, axes = plt.subplots(nrows = 1, ncols = df_act_pred.shape[0] , figsize=(df_act_pred.shape[0]*5,3))
for i, row in df_act_pred.iterrows():
    y_test_f = row['Actuals']
    y_pred_f_bin = row['Predicted']
    
    confusion_matrix = metrics.confusion_matrix(y_test_f, y_pred_f_bin)

    cm_display = metrics.ConfusionMatrixDisplay(confusion_matrix = confusion_matrix, display_labels = [False, True])
    cm_display.plot(ax = axes[i])
    cm_display.ax_.set_title(row['Model'])
plt.show()

Die Verteilung innerhalb der Konfusion Matrix zeigt das selbe Ergebnis

  • Der getunte Random Forest eignet sich am Besten

2. Prognosemodell: needs repair yes/no¶

In [357]:
lr_disp = RocCurveDisplay.from_estimator(lr_cf_tuned, X_test_u, y_test_r)
svm_disp = RocCurveDisplay.from_estimator(svm_cf_tuned, X_test_u, y_test_r, ax=lr_disp.ax_)
dt_disp = RocCurveDisplay.from_estimator(dt_cf_, X_test_u, y_test_r, ax=lr_disp.ax_)
rf_disp = RocCurveDisplay.from_estimator(rf_cf_tuned, X_test_u, y_test_r, ax=lr_disp.ax_)
rf_lbl_disp = RocCurveDisplay.from_estimator(rf_cf_lbl, X_test_lbl_w_age_u, y_test_r, ax=lr_disp.ax_)
xgb_disp = RocCurveDisplay.from_estimator(xgb_cf_tuned, X_test_u, y_test_r, ax=lr_disp.ax_)
lr_disp.figure_.suptitle("ROC curve comparison")
lr_disp.figure_.set_size_inches(8, 6)
plt.show()

Fazit Modellauswahl mit Testdaten: Bei der Evaluation mit den Testdaten wird klar, dass der Random Forest den höchsten AUC-Score liefern kann. Auch die SVM kann gut mithalten, allerdings ist sie aufgrund der "Trainingskosten" zusätzlich zu benachteiligen.

Welche Features waren wichtig für die Entscheidungen des Random Forest?

Für den Ordinal Encoding Random Forest kann der Einfluss von unterschiedlichen Features auf das Modell betrachtet werden. Es lassen sich ausgehend von im Business Understanding getroffenen Annahmen, welche Features vermutlich auf die Funktionalität eines Brunnens Einfluss haben könnten, einige Überraschungen feststellen.

In [368]:
feature_names = ['funder', 'installer', 'subvillage', 'lga', 'ward', 'management', 'basin', 'payment','water_quality', 'quantity',
                    'source','waterpoint_type', 'extraction_type_class', 'region', 'region_code', 'district_code', 'wpt_name',
                    "amount_tsh", "population", "amount_per_capita", 'gps_height', 'longitude', 'latitude', "scheme_management",
                    "date_recorded", "age"]
feature_scores = pd.Series(rf_cf_lbl.feature_importances_, index=feature_names).sort_values(ascending=False)
f, ax = plt.subplots(figsize=(12, 8))
ax = sns.barplot(x=feature_scores, y=feature_scores.index)
ax.set_title("Feature Importance (Gain)")
ax.set_yticklabels(feature_scores.index)
ax.set_xlabel("Feature importance score")
ax.set_ylabel("Features")
plt.show()
In [371]:
from sklearn.metrics import confusion_matrix
predictions = rf_cf_tuned.predict(X_test_u)
cm = confusion_matrix(y_test_r, predictions, labels=rf_cf_tuned.classes_)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['needs_no_rep', 'func_needs_rep'])
disp.plot()
plt.title('RandomForestClassifier')
plt.show()
print('RandomForestClassifier: \n',classification_report(y_test_r, predictions))
RandomForestClassifier: 
               precision    recall  f1-score   support

           0       0.98      0.77      0.86      8055
           1       0.19      0.78      0.31       575

    accuracy                           0.77      8630
   macro avg       0.59      0.77      0.59      8630
weighted avg       0.93      0.77      0.82      8630

Mit dem Modell werden viele der reparaturbedürftigen Brunnen erkannt (nur 126 nicht, recall für 1 von 0.78). Dies geht jedoch auf Kosten der Precision für diese Klasse. Nachdem man jedoch tendenziell eher alle reparaturbedürftigen Brunnen erkennen möchte, ist dies annehmbar.

Fazit Evaluation gewähltes Modell: Wie bereits in der Data Exploration und in der Data Preparation klar wurde, besteht für die Ausprägung status_group = functional needs repair ein starkes Ungleichgewicht in den Daten. Aufgrund dessen kann hier im Vergleich zu Prognosemodell 1 kein gleich hoher AUC Score erreicht werden.

Aus wirtschaftlicher Sicht muss sich die Frage gestellt werden, ob eine Optimierung nach einem anderen Score (z.B. Recall, wenn möglichst alle wartungsbedürftigen Brunnen erkannt werden sollen) dem Ziel des Prognosemodells eher gerecht werden würde. Hier ist sind Aspekte wie der Precision-Recall Trade-off zu beachten.

3. Prognosemodell: functional vs. non-functional vs. functional_needs_repair¶

In [395]:
auc_test_rf = roc_auc_score(y_test, rf_cf_opt.predict_proba(X_test), multi_class='ovr')
auc_test_xgb = roc_auc_score(y_test, xgb_cf_opt.predict_proba(X_test), multi_class='ovr')
In [396]:
auc_test_rf_w = roc_auc_score(y_test, rf_cf_w_opt.predict_proba(X_test), multi_class='ovr')
auc_test_xgb_w = roc_auc_score(y_test, xgb_cf_w_opt.predict_proba(X_test), multi_class='ovr')
In [397]:
auc_test_rf_age = roc_auc_score(y_test, rf_cf_opt_age.predict_proba(X_test_w_age), multi_class='ovr')
auc_test_xgb_age = roc_auc_score(y_test, xgb_cf_opt_age.predict_proba(X_test_w_age), multi_class='ovr')
In [398]:
df_tuned["AUC Test"] = [auc_test_rf, auc_test_xgb, auc_test_xgb_w, auc_test_rf_w, auc_test_rf_age, auc_test_xgb_age]
df_tuned
Out[398]:
Model AUC Train AUC CV AUC Test
0 Random Forest - optimized - {'max_depth': 18, ... 0.992420 0.916088 0.899027
1 XGBoost - optimized - {'learning_rate': 0.1, '... 0.978028 0.926525 0.859736
2 XGBoost - weights -optimized {'learning_rate':... 0.978048 0.926460 0.863119
3 Random Forest - weights -optimized {'max_depth... 0.993145 0.915264 0.891627
4 Random Forest - with Age - optimized - {'max_d... 0.992760 0.916654 0.899959
5 XGBoost - with Age - optimized {'learning_rate... 0.983787 0.927360 0.863010
In [399]:
plot_ROC(rf_cf_opt_age, X_train_w_age, y_train, X_test_w_age, y_test);
In [402]:
fig = plt.figure(figsize=(14,5))
ax1 = fig.add_subplot(121)
plot_confusion_matrix(rf_cf_opt, X_test, y_test, ax=ax1)
ax2 = fig.add_subplot(122)
plot_confusion_matrix(rf_cf_w_opt, X_test, y_test, ax=ax2)
plt.show()
  • die XGBoost-Modelle brechen auf dem Test-Set in ihrer Performance stark ein im Vergleich zu den Random-Forest-Modellen, so dass diese die beste AUC aufweisen
  • am besten ist der Random Forest, der das Alter des Brunnens als zusätzliches Input feature bekommen hat
  • durch Gewichtung der Klassen verändert sich die Zusammensetzung der Konfusions-Matrix etwas, der AUC ist jedoch minimal schlechter

Deployment¶

Das Deployment der Modelle ist nicht mehr Teil der Aufgabenstellung. Im Falle eines Deployments würden jedoch folgende Prognosemodelle für die 3 Fälle ausgewählt werden:

  • Prognosemodell 1: rf_cf_t (Tuned Random Forest, Daten mit col_trans transformiert)
  • Prognosemodell 2: rf_cf_tuned (Tuned Random Forest, Daten mit col_trans transformiert)
  • Prognosemodell 3: rf_cf_opt_age (Tuned Random Forest, Daten mit col_trans_w_age transformiert)

Allen drei Modellen liegt ein Random Forest zugrunde. Aufgrund der vielen kategoriellen Merkmale scheint diese Algorithmen-Wahl jedoch nicht verwunderlich.

Quellen:¶

https://towardsdatascience.com/the-easiest-way-to-plot-data-from-pandas-on-a-world-map-1a62962a27f3

https://www.scikit-yb.org/en/latest/api/classifier/rocauc.html#multi-class-rocauc-curves

https://plotly.com/python/scattermapbox/

https://imbalanced-learn.org/stable/under_sampling.html

https://contrib.scikit-learn.org/category_encoders/index.html

https://towardsdatascience.com/why-weight-the-importance-of-training-on-balanced-datasets-f1e54688e7df